Analyzing and Visualizing a Loan Dataset

by Alf Maglalang

Preliminary Wrangling

This loan data is from a lending company called Prosper. It was last updated 3/11/2014. It has 113937 records with 81 features, namely

ListingKey, ListingNumber, ListingCreationDate, CreditGrade, Term, LoanStatus, ClosedDate, BorrowerAPR, BorrowerRate, LenderYield, EstimatedEffectiveYield, EstimatedLoss, EstimatedReturn, ProsperRating (numeric), ProsperRating (Alpha), ProsperScore, ListingCategory (numeric), BorrowerState, Occupation, EmploymentStatus, EmploymentStatusDuration, IsBorrowerHomeowner, CurrentlyInGroup, GroupKey, DateCreditPulled, CreditScoreRangeLower, CreditScoreRangeUpper, FirstRecordedCreditLine, CurrentCreditLines, OpenCreditLines, TotalCreditLinespast7years, OpenRevolvingAccounts, OpenRevolvingMonthlyPayment, InquiriesLast6Months, TotalInquiries, CurrentDelinquencies, AmountDelinquent, DelinquenciesLast7Years, PublicRecordsLast10Years, PublicRecordsLast12Months, RevolvingCreditBalance, BankcardUtilization, AvailableBankcardCredit, TotalTrades, TradesNeverDelinquent (percentage), TradesOpenedLast6Months, DebtToIncomeRatio, IncomeRange, IncomeVerifiable, StatedMonthlyIncome, LoanKey, TotalProsperLoans, TotalProsperPaymentsBilled, OnTimeProsperPayments, ProsperPaymentsLessThanOneMonthLate, ProsperPaymentsOneMonthPlusLate, ProsperPrincipalBorrowed, ProsperPrincipalOutstanding, ScorexChangeAtTimeOfListing, LoanCurrentDaysDelinquent, LoanFirstDefaultedCycleNumber, LoanMonthsSinceOrigination, LoanNumber, LoanOriginalAmount, LoanOriginationDate, LoanOriginationQuarter, MemberKey, MonthlyLoanPayment, LP_CustomerPayments, LP_CustomerPrincipalPayments, LP_InterestandFees, LP_ServiceFees, LP_CollectionFees, LP_GrossPrincipalLoss, LP_NetPrincipalLoss, LP_NonPrincipalRecoverypayments, PercentFunded, Recommendations, InvestmentFromFriendsCount, InvestmentFromFriendsAmount, Investors

Please click on the following link to see the definitions of the data's features. Prosper Data Dictionary to Explain Dataset's Variables

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.

In [2]:
# pld will be short for prosperLoanData
pld = pd.read_csv('prosperLoanData.csv')
In [3]:
# show all of the columns
pd.options.display.max_columns = None
pld.head()
Out[3]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating (numeric) ProsperRating (Alpha) ProsperScore ListingCategory (numeric) BorrowerState Occupation EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup GroupKey DateCreditPulled CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit TotalTrades TradesNeverDelinquent (percentage) TradesOpenedLast6Months DebtToIncomeRatio IncomeRange IncomeVerifiable StatedMonthlyIncome LoanKey TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter MemberKey MonthlyLoanPayment LP_CustomerPayments LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
0 1021339766868145413AB3B 193129 2007-08-26 19:09:29.263000000 C 36 Completed 2009-08-14 00:00:00 0.16516 0.1580 0.1380 NaN NaN NaN NaN NaN NaN 0 CO Other Self-employed 2.0 True True NaN 2007-08-26 18:41:46.780000000 640.0 659.0 2001-10-11 00:00:00 5.0 4.0 12.0 1 24.0 3.0 3.0 2.0 472.0 4.0 0.0 0.0 0.0 0.00 1500.0 11.0 0.81 0.0 0.17 $25,000-49,999 True 3083.333333 E33A3400205839220442E84 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 78 19141 9425 2007-09-12 00:00:00 Q3 2007 1F3E3376408759268057EDA 330.43 11396.14 9425.00 1971.14 -133.18 0.0 0.0 0.0 0.0 1.0 0 0 0.0 258
1 10273602499503308B223C1 1209647 2014-02-27 08:28:07.900000000 NaN 36 Current NaN 0.12016 0.0920 0.0820 0.07960 0.0249 0.05470 6.0 A 7.0 2 CO Professional Employed 44.0 False False NaN 2014-02-27 08:28:14 680.0 699.0 1996-03-18 00:00:00 14.0 14.0 29.0 13 389.0 3.0 5.0 0.0 0.0 0.0 1.0 0.0 3989.0 0.21 10266.0 29.0 1.00 2.0 0.18 $50,000-74,999 True 6125.000000 9E3B37071505919926B1D82 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 0 134815 10000 2014-03-03 00:00:00 Q1 2014 1D13370546739025387B2F4 318.93 0.00 0.00 0.00 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
2 0EE9337825851032864889A 81716 2007-01-05 15:00:47.090000000 HR 36 Completed 2009-12-17 00:00:00 0.28269 0.2750 0.2400 NaN NaN NaN NaN NaN NaN 0 GA Other Not available NaN False True 783C3371218786870A73D20 2007-01-02 14:09:10.060000000 480.0 499.0 2002-07-27 00:00:00 NaN NaN 3.0 0 0.0 0.0 1.0 1.0 NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 0.06 Not displayed True 2083.333333 6954337960046817851BCB2 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 86 6466 3001 2007-01-17 00:00:00 Q1 2007 5F7033715035555618FA612 123.32 4186.63 3001.00 1185.63 -24.20 0.0 0.0 0.0 0.0 1.0 0 0 0.0 41
3 0EF5356002482715299901A 658116 2012-10-22 11:02:35.010000000 NaN 36 Current NaN 0.12528 0.0974 0.0874 0.08490 0.0249 0.06000 6.0 A 9.0 16 GA Skilled Labor Employed 113.0 True False NaN 2012-10-22 11:02:32 800.0 819.0 1983-02-28 00:00:00 5.0 5.0 29.0 7 115.0 0.0 1.0 4.0 10056.0 14.0 0.0 0.0 1444.0 0.04 30754.0 26.0 0.76 0.0 0.15 $25,000-49,999 True 2875.000000 A0393664465886295619C51 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 16 77296 10000 2012-11-01 00:00:00 Q4 2012 9ADE356069835475068C6D2 321.45 5143.20 4091.09 1052.11 -108.01 0.0 0.0 0.0 0.0 1.0 0 0 0.0 158
4 0F023589499656230C5E3E2 909464 2013-09-14 18:38:39.097000000 NaN 36 Current NaN 0.24614 0.2085 0.1985 0.18316 0.0925 0.09066 3.0 D 4.0 2 MN Executive Employed 44.0 True False NaN 2013-09-14 18:38:44 680.0 699.0 2004-02-20 00:00:00 19.0 19.0 49.0 6 220.0 1.0 9.0 0.0 0.0 0.0 0.0 0.0 6193.0 0.81 695.0 39.0 0.95 2.0 0.26 $100,000+ True 9583.333333 A180369302188889200689E 1.0 11.0 11.0 0.0 0.0 11000.0 9947.9 NaN 0 NaN 6 102670 15000 2013-09-20 00:00:00 Q3 2013 36CE356043264555721F06C 563.97 2819.85 1563.22 1256.63 -60.27 0.0 0.0 0.0 0.0 1.0 0 0 0.0 20
In [4]:
pld.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 81 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   ListingKey                           113937 non-null  object 
 1   ListingNumber                        113937 non-null  int64  
 2   ListingCreationDate                  113937 non-null  object 
 3   CreditGrade                          28953 non-null   object 
 4   Term                                 113937 non-null  int64  
 5   LoanStatus                           113937 non-null  object 
 6   ClosedDate                           55089 non-null   object 
 7   BorrowerAPR                          113912 non-null  float64
 8   BorrowerRate                         113937 non-null  float64
 9   LenderYield                          113937 non-null  float64
 10  EstimatedEffectiveYield              84853 non-null   float64
 11  EstimatedLoss                        84853 non-null   float64
 12  EstimatedReturn                      84853 non-null   float64
 13  ProsperRating (numeric)              84853 non-null   float64
 14  ProsperRating (Alpha)                84853 non-null   object 
 15  ProsperScore                         84853 non-null   float64
 16  ListingCategory (numeric)            113937 non-null  int64  
 17  BorrowerState                        108422 non-null  object 
 18  Occupation                           110349 non-null  object 
 19  EmploymentStatus                     111682 non-null  object 
 20  EmploymentStatusDuration             106312 non-null  float64
 21  IsBorrowerHomeowner                  113937 non-null  bool   
 22  CurrentlyInGroup                     113937 non-null  bool   
 23  GroupKey                             13341 non-null   object 
 24  DateCreditPulled                     113937 non-null  object 
 25  CreditScoreRangeLower                113346 non-null  float64
 26  CreditScoreRangeUpper                113346 non-null  float64
 27  FirstRecordedCreditLine              113240 non-null  object 
 28  CurrentCreditLines                   106333 non-null  float64
 29  OpenCreditLines                      106333 non-null  float64
 30  TotalCreditLinespast7years           113240 non-null  float64
 31  OpenRevolvingAccounts                113937 non-null  int64  
 32  OpenRevolvingMonthlyPayment          113937 non-null  float64
 33  InquiriesLast6Months                 113240 non-null  float64
 34  TotalInquiries                       112778 non-null  float64
 35  CurrentDelinquencies                 113240 non-null  float64
 36  AmountDelinquent                     106315 non-null  float64
 37  DelinquenciesLast7Years              112947 non-null  float64
 38  PublicRecordsLast10Years             113240 non-null  float64
 39  PublicRecordsLast12Months            106333 non-null  float64
 40  RevolvingCreditBalance               106333 non-null  float64
 41  BankcardUtilization                  106333 non-null  float64
 42  AvailableBankcardCredit              106393 non-null  float64
 43  TotalTrades                          106393 non-null  float64
 44  TradesNeverDelinquent (percentage)   106393 non-null  float64
 45  TradesOpenedLast6Months              106393 non-null  float64
 46  DebtToIncomeRatio                    105383 non-null  float64
 47  IncomeRange                          113937 non-null  object 
 48  IncomeVerifiable                     113937 non-null  bool   
 49  StatedMonthlyIncome                  113937 non-null  float64
 50  LoanKey                              113937 non-null  object 
 51  TotalProsperLoans                    22085 non-null   float64
 52  TotalProsperPaymentsBilled           22085 non-null   float64
 53  OnTimeProsperPayments                22085 non-null   float64
 54  ProsperPaymentsLessThanOneMonthLate  22085 non-null   float64
 55  ProsperPaymentsOneMonthPlusLate      22085 non-null   float64
 56  ProsperPrincipalBorrowed             22085 non-null   float64
 57  ProsperPrincipalOutstanding          22085 non-null   float64
 58  ScorexChangeAtTimeOfListing          18928 non-null   float64
 59  LoanCurrentDaysDelinquent            113937 non-null  int64  
 60  LoanFirstDefaultedCycleNumber        16952 non-null   float64
 61  LoanMonthsSinceOrigination           113937 non-null  int64  
 62  LoanNumber                           113937 non-null  int64  
 63  LoanOriginalAmount                   113937 non-null  int64  
 64  LoanOriginationDate                  113937 non-null  object 
 65  LoanOriginationQuarter               113937 non-null  object 
 66  MemberKey                            113937 non-null  object 
 67  MonthlyLoanPayment                   113937 non-null  float64
 68  LP_CustomerPayments                  113937 non-null  float64
 69  LP_CustomerPrincipalPayments         113937 non-null  float64
 70  LP_InterestandFees                   113937 non-null  float64
 71  LP_ServiceFees                       113937 non-null  float64
 72  LP_CollectionFees                    113937 non-null  float64
 73  LP_GrossPrincipalLoss                113937 non-null  float64
 74  LP_NetPrincipalLoss                  113937 non-null  float64
 75  LP_NonPrincipalRecoverypayments      113937 non-null  float64
 76  PercentFunded                        113937 non-null  float64
 77  Recommendations                      113937 non-null  int64  
 78  InvestmentFromFriendsCount           113937 non-null  int64  
 79  InvestmentFromFriendsAmount          113937 non-null  float64
 80  Investors                            113937 non-null  int64  
dtypes: bool(3), float64(50), int64(11), object(17)
memory usage: 68.1+ MB
In [5]:
pld.shape
Out[5]:
(113937, 81)
In [6]:
pld.describe()
Out[6]:
ListingNumber Term BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating (numeric) ProsperScore ListingCategory (numeric) EmploymentStatusDuration CreditScoreRangeLower CreditScoreRangeUpper CurrentCreditLines OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit TotalTrades TradesNeverDelinquent (percentage) TradesOpenedLast6Months DebtToIncomeRatio StatedMonthlyIncome TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber LoanOriginalAmount MonthlyLoanPayment LP_CustomerPayments LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
count 1.139370e+05 113937.000000 113912.000000 113937.000000 113937.000000 84853.000000 84853.000000 84853.000000 84853.000000 84853.000000 113937.000000 106312.000000 113346.000000 113346.000000 106333.000000 106333.000000 113240.000000 113937.00000 113937.000000 113240.000000 112778.000000 113240.000000 106315.000000 112947.000000 113240.000000 106333.000000 1.063330e+05 106333.000000 106393.000000 106393.000000 106393.000000 106393.000000 105383.000000 1.139370e+05 22085.000000 22085.000000 22085.000000 22085.000000 22085.000000 22085.000000 22085.000000 18928.000000 113937.000000 16952.000000 113937.000000 113937.000000 113937.00000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000
mean 6.278857e+05 40.830248 0.218828 0.192764 0.182701 0.168661 0.080306 0.096068 4.072243 5.950067 2.774209 96.071582 685.567731 704.567731 10.317192 9.260164 26.754539 6.96979 398.292161 1.435085 5.584405 0.592052 984.507059 4.154984 0.312646 0.015094 1.759871e+04 0.561309 11210.225447 23.230034 0.885897 0.802327 0.275947 5.608026e+03 1.421100 22.934345 22.271949 0.613629 0.048540 8472.311961 2930.313906 -3.223214 152.816539 16.268464 31.896882 69444.474271 8337.01385 272.475783 4183.079489 3105.536588 1077.542901 -54.725641 -14.242698 700.446342 681.420499 25.142686 0.998584 0.048027 0.023460 16.550751 80.475228
std 3.280762e+05 10.436212 0.080364 0.074818 0.074516 0.068467 0.046764 0.030403 1.673227 2.376501 3.996797 94.480605 66.458275 66.458275 5.457866 5.022644 13.637871 4.63097 447.159711 2.437507 6.429946 1.978707 7158.270157 10.160216 0.727868 0.154092 3.293640e+04 0.317918 19818.361309 11.871311 0.148179 1.097637 0.551759 7.478497e+03 0.764042 19.249584 18.830425 2.446827 0.556285 7395.507650 3806.635075 50.063567 466.320254 9.005898 29.974184 38930.479610 6245.80058 192.697812 4790.907234 4069.527670 1183.414168 60.675425 109.232758 2388.513831 2357.167068 275.657937 0.017919 0.332353 0.232412 294.545422 103.239020
min 4.000000e+00 12.000000 0.006530 0.000000 -0.010000 -0.182700 0.004900 -0.182700 1.000000 1.000000 0.000000 0.000000 0.000000 19.000000 0.000000 0.000000 2.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -209.000000 0.000000 0.000000 0.000000 1.000000 1000.00000 0.000000 -2.349900 0.000000 -2.349900 -664.870000 -9274.750000 -94.200000 -954.550000 0.000000 0.700000 0.000000 0.000000 0.000000 1.000000
25% 4.009190e+05 36.000000 0.156290 0.134000 0.124200 0.115670 0.042400 0.074080 3.000000 4.000000 1.000000 26.000000 660.000000 679.000000 7.000000 6.000000 17.000000 4.00000 114.000000 0.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.121000e+03 0.310000 880.000000 15.000000 0.820000 0.000000 0.140000 3.200333e+03 1.000000 9.000000 9.000000 0.000000 0.000000 3500.000000 0.000000 -35.000000 0.000000 9.000000 6.000000 37332.000000 4000.00000 131.620000 1005.760000 500.890000 274.870000 -73.180000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 2.000000
50% 6.005540e+05 36.000000 0.209760 0.184000 0.173000 0.161500 0.072400 0.091700 4.000000 6.000000 1.000000 67.000000 680.000000 699.000000 10.000000 9.000000 25.000000 6.00000 271.000000 1.000000 4.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.549000e+03 0.600000 4100.000000 22.000000 0.940000 0.000000 0.220000 4.666667e+03 1.000000 16.000000 15.000000 0.000000 0.000000 6000.000000 1626.550000 -3.000000 0.000000 14.000000 21.000000 68599.000000 6500.00000 217.740000 2583.830000 1587.500000 700.840100 -34.440000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 44.000000
75% 8.926340e+05 36.000000 0.283810 0.250000 0.240000 0.224300 0.112000 0.116600 5.000000 8.000000 3.000000 137.000000 720.000000 739.000000 13.000000 12.000000 35.000000 9.00000 525.000000 2.000000 7.000000 0.000000 0.000000 3.000000 0.000000 0.000000 1.952100e+04 0.840000 13180.000000 30.000000 1.000000 1.000000 0.320000 6.825000e+03 2.000000 33.000000 32.000000 0.000000 0.000000 11000.000000 4126.720000 25.000000 0.000000 22.000000 65.000000 101901.000000 12000.00000 371.580000 5548.400000 4000.000000 1458.540000 -13.920000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 115.000000
max 1.255725e+06 60.000000 0.512290 0.497500 0.492500 0.319900 0.366000 0.283700 7.000000 11.000000 20.000000 755.000000 880.000000 899.000000 59.000000 54.000000 136.000000 51.00000 14985.000000 105.000000 379.000000 83.000000 463881.000000 99.000000 38.000000 20.000000 1.435667e+06 5.950000 646285.000000 126.000000 1.000000 20.000000 10.010000 1.750003e+06 8.000000 141.000000 141.000000 42.000000 21.000000 72499.000000 23450.950000 286.000000 2704.000000 44.000000 100.000000 136486.000000 35000.00000 2251.510000 40702.390000 35000.000000 15617.030000 32.060000 0.000000 25000.000000 25000.000000 21117.900000 1.012500 39.000000 33.000000 25000.000000 1189.000000
In [7]:
#features_corr = ['Term', 'LoanStatus', 'BorrowerRate', 'LenderYield', 'ProsperRating (numeric)', 'ProsperScore', 'ListingCategory (numeric)', 'BorrowerState', 'EmploymentStatus', 'EmploymentStatusDuration', 'IsBorrowerHomeowner', 'CreditScoreRangeUpper', 'FirstRecordedCreditLine', 'CurrentCreditLines', 'OpenCreditLines', 'TotalCreditLinespast7years', 'OpenRevolvingAccounts', 'OpenRevolvingMonthlyPayment', 'InquiriesLast6Months', 'TotalInquiries', 'CurrentDelinquencies', 'AmountDelinquent', 'DelinquenciesLast7Years', 'PublicRecordsLast12Months', 'RevolvingCreditBalance', 'AvailableBankcardCredit', 'DebtToIncomeRatio', 'IncomeRange', 'IncomeVerifiable', 'StatedMonthlyIncome', 'LoanOriginalAmount', 'MonthlyLoanPayment']
features_corr = ['Term', 'LoanStatus', 'BorrowerRate', 'ProsperRating (numeric)', 'ProsperScore', 'ListingCategory (numeric)', 'BorrowerState', 'EmploymentStatus', 'EmploymentStatusDuration', 'IsBorrowerHomeowner', 'CreditScoreRangeUpper', 'FirstRecordedCreditLine', 'CurrentCreditLines', 'InquiriesLast6Months', 'CurrentDelinquencies', 'AmountDelinquent', 'DelinquenciesLast7Years', 'PublicRecordsLast12Months', 'RevolvingCreditBalance', 'AvailableBankcardCredit', 'DebtToIncomeRatio', 'IncomeRange', 'StatedMonthlyIncome', 'LoanOriginalAmount']
temp2_df = pld[features_corr]
corr_big = temp2_df.corr()
corr_big
Out[7]:
Term BorrowerRate ProsperRating (numeric) ProsperScore ListingCategory (numeric) EmploymentStatusDuration IsBorrowerHomeowner CreditScoreRangeUpper CurrentCreditLines InquiriesLast6Months CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast12Months RevolvingCreditBalance AvailableBankcardCredit DebtToIncomeRatio StatedMonthlyIncome LoanOriginalAmount
Term 1.000000 0.020085 0.079179 0.028947 0.004947 0.082476 0.085339 0.126263 0.077072 -0.113568 -0.083807 -0.016459 -0.041492 -0.026252 0.026022 0.015348 -0.014670 0.028479 0.338927
BorrowerRate 0.020085 1.000000 -0.953105 -0.649736 0.102913 -0.019907 -0.134431 -0.461567 -0.097473 0.183810 0.176530 0.065645 0.170279 0.051169 -0.059608 -0.343861 0.062917 -0.088982 -0.328960
ProsperRating (numeric) 0.079179 -0.953105 1.000000 0.705221 -0.094474 0.036073 0.136470 0.548874 0.092377 -0.264305 -0.145205 -0.051071 -0.147149 -0.035791 0.062068 0.374432 -0.135344 0.094306 0.428557
ProsperScore 0.028947 -0.649736 0.705221 1.000000 -0.009718 -0.007302 0.064438 0.369603 -0.008497 -0.296762 -0.100612 -0.041601 -0.097755 -0.014885 0.042417 0.318558 -0.145336 0.083757 0.266293
ListingCategory (numeric) 0.004947 0.102913 -0.094474 -0.009718 1.000000 -0.005544 -0.038224 0.102400 -0.105035 -0.072644 -0.049936 0.022202 0.016950 0.003167 -0.050053 -0.031517 -0.042754 0.001599 -0.127130
EmploymentStatusDuration 0.082476 -0.019907 0.036073 -0.007302 -0.005544 1.000000 0.171046 0.081134 0.140493 -0.048024 -0.009078 0.008185 0.012221 -0.003918 0.118320 0.053008 -0.011609 0.069830 0.098149
IsBorrowerHomeowner 0.085339 -0.134431 0.136470 0.064438 -0.038224 0.171046 1.000000 0.293516 0.278295 0.006893 -0.055454 0.038122 -0.070798 -0.015016 0.224649 0.142039 0.000177 0.140051 0.206787
CreditScoreRangeUpper 0.126263 -0.461567 0.548874 0.369603 0.102400 0.081134 0.293516 1.000000 0.142092 -0.262135 -0.368450 -0.065849 -0.251219 -0.083443 0.088848 0.453257 -0.013169 0.107901 0.340874
CurrentCreditLines 0.077072 -0.097473 0.092377 -0.008497 -0.105035 0.140493 0.278295 0.142092 1.000000 0.069662 -0.165860 -0.072590 -0.180398 -0.059379 0.333656 0.318165 0.091878 0.143329 0.201383
InquiriesLast6Months -0.113568 0.183810 -0.264305 -0.296762 -0.072644 -0.048024 0.006893 -0.262135 0.069662 1.000000 0.156342 0.023969 0.090329 0.048873 -0.007397 -0.004564 0.024436 0.014854 -0.102878
CurrentDelinquencies -0.083807 0.176530 -0.145205 -0.100612 -0.049936 -0.009078 -0.055454 -0.368450 -0.165860 0.156342 1.000000 0.340549 0.377777 0.111661 -0.088831 -0.092433 -0.024265 -0.038835 -0.172139
AmountDelinquent -0.016459 0.065645 -0.051071 -0.041601 0.022202 0.008185 0.038122 -0.065849 -0.072590 0.023969 0.340549 1.000000 0.233270 0.041158 -0.021869 -0.020286 -0.019397 0.008662 -0.038756
DelinquenciesLast7Years -0.041492 0.170279 -0.147149 -0.097755 0.016950 0.012221 -0.070798 -0.251219 -0.180398 0.090329 0.377777 0.233270 1.000000 0.083850 -0.125917 -0.134486 -0.043877 -0.025805 -0.136667
PublicRecordsLast12Months -0.026252 0.051169 -0.035791 -0.014885 0.003167 -0.003918 -0.015016 -0.083443 -0.059379 0.048873 0.111661 0.041158 0.083850 1.000000 -0.017092 -0.027697 -0.008151 0.000422 -0.042798
RevolvingCreditBalance 0.026022 -0.059608 0.062068 0.042417 -0.050053 0.118320 0.224649 0.088848 0.333656 -0.007397 -0.088831 -0.021869 -0.125917 -0.017092 1.000000 0.239789 0.038636 0.194898 0.190879
AvailableBankcardCredit 0.015348 -0.343861 0.374432 0.318558 -0.031517 0.053008 0.142039 0.453257 0.318165 -0.004564 -0.092433 -0.020286 -0.134486 -0.027697 0.239789 1.000000 0.002059 0.111869 0.229866
DebtToIncomeRatio -0.014670 0.062917 -0.135344 -0.145336 -0.042754 -0.011609 0.000177 -0.013169 0.091878 0.024436 -0.024265 -0.019397 -0.043877 -0.008151 0.038636 0.002059 1.000000 -0.122659 0.010112
StatedMonthlyIncome 0.028479 -0.088982 0.094306 0.083757 0.001599 0.069830 0.140051 0.107901 0.143329 0.014854 -0.038835 0.008662 -0.025805 0.000422 0.194898 0.111869 -0.122659 1.000000 0.201259
LoanOriginalAmount 0.338927 -0.328960 0.428557 0.266293 -0.127130 0.098149 0.206787 0.340874 0.201383 -0.102878 -0.172139 -0.038756 -0.136667 -0.042798 0.190879 0.229866 0.010112 0.201259 1.000000
from seaborn import heatmap fig, ax = plt.subplots(figsize=(8,16)) ax = heatmap(corr_big, fmt='.2f', annot=True, ax=ax, cmap='RdBu_r') fig.savefig('corr.png', dpi=300, bbox_inches='tight');
In [8]:
pld.EmploymentStatus.value_counts()
Out[8]:
Employed         67322
Full-time        26355
Self-employed     6134
Not available     5347
Other             3806
Part-time         1088
Not employed       835
Retired            795
Name: EmploymentStatus, dtype: int64
In [9]:
print(list(pld.columns))
['ListingKey', 'ListingNumber', 'ListingCreationDate', 'CreditGrade', 'Term', 'LoanStatus', 'ClosedDate', 'BorrowerAPR', 'BorrowerRate', 'LenderYield', 'EstimatedEffectiveYield', 'EstimatedLoss', 'EstimatedReturn', 'ProsperRating (numeric)', 'ProsperRating (Alpha)', 'ProsperScore', 'ListingCategory (numeric)', 'BorrowerState', 'Occupation', 'EmploymentStatus', 'EmploymentStatusDuration', 'IsBorrowerHomeowner', 'CurrentlyInGroup', 'GroupKey', 'DateCreditPulled', 'CreditScoreRangeLower', 'CreditScoreRangeUpper', 'FirstRecordedCreditLine', 'CurrentCreditLines', 'OpenCreditLines', 'TotalCreditLinespast7years', 'OpenRevolvingAccounts', 'OpenRevolvingMonthlyPayment', 'InquiriesLast6Months', 'TotalInquiries', 'CurrentDelinquencies', 'AmountDelinquent', 'DelinquenciesLast7Years', 'PublicRecordsLast10Years', 'PublicRecordsLast12Months', 'RevolvingCreditBalance', 'BankcardUtilization', 'AvailableBankcardCredit', 'TotalTrades', 'TradesNeverDelinquent (percentage)', 'TradesOpenedLast6Months', 'DebtToIncomeRatio', 'IncomeRange', 'IncomeVerifiable', 'StatedMonthlyIncome', 'LoanKey', 'TotalProsperLoans', 'TotalProsperPaymentsBilled', 'OnTimeProsperPayments', 'ProsperPaymentsLessThanOneMonthLate', 'ProsperPaymentsOneMonthPlusLate', 'ProsperPrincipalBorrowed', 'ProsperPrincipalOutstanding', 'ScorexChangeAtTimeOfListing', 'LoanCurrentDaysDelinquent', 'LoanFirstDefaultedCycleNumber', 'LoanMonthsSinceOrigination', 'LoanNumber', 'LoanOriginalAmount', 'LoanOriginationDate', 'LoanOriginationQuarter', 'MemberKey', 'MonthlyLoanPayment', 'LP_CustomerPayments', 'LP_CustomerPrincipalPayments', 'LP_InterestandFees', 'LP_ServiceFees', 'LP_CollectionFees', 'LP_GrossPrincipalLoss', 'LP_NetPrincipalLoss', 'LP_NonPrincipalRecoverypayments', 'PercentFunded', 'Recommendations', 'InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount', 'Investors']
In [10]:
pld.ListingKey.nunique()
Out[10]:
113066
In [11]:
# there are 113937 records and 113066 unique listing keys. I will verify quickly if these are duplicates.
key_counts = pld.ListingKey.value_counts()
key_counts
Out[11]:
17A93590655669644DB4C06    6
349D3587495831350F0F648    4
8474358854651984137201C    4
DE8535960513435199406CE    4
47C1359638497431975670B    4
                          ..
133F35105976906619F2021    1
FFE435529085737521DC57E    1
BEA635239618896479B169A    1
13A63586265117534D60796    1
E67D3552823846562E40B23    1
Name: ListingKey, Length: 113066, dtype: int64
In [12]:
key_counts.value_counts()
Out[12]:
1    112239
2       790
3        32
4         4
6         1
Name: ListingKey, dtype: int64
In [13]:
dup_idx = pld[pld.ListingKey.duplicated()]['ListingKey']
print(type(dup_idx))
print(dup_idx[:10])
<class 'pandas.core.series.Series'>
9       0F043596202561788EA13D5
999     0A0635972629771021E38F3
2539    2D2635984503681189056B4
4942    4B7E3590031274113F1FD34
5812    94B035909835592957503E6
5971    2EC635885737984509D5366
6385    083235967773321272C26B9
7415    5A2F35875117483773421B0
7538    09AD35918712001025AC1BD
7555    021F3593082999771F5E621
Name: ListingKey, dtype: object
In [14]:
tt = dup_idx.tolist()
tt
len(tt)
Out[14]:
871
In [15]:
dup_listing_idx = set(dup_idx.tolist())
len(dup_listing_idx)
Out[15]:
827

Note:

According to Prosper variable definitions, ListingKey is supposed to be UNIQUE. dup_listing_idx are the listing keys that have duplicates. I will remove their duplicates but keep the first. I will then assign a -1 in their ProsperScore to mark them as to be "corrected"

In [16]:
sub_cols = ['ProsperScore', 'ListingKey', 'ListingNumber', 'ListingCreationDate', 'CreditGrade', 'Term', 'LoanStatus', 'ClosedDate', 'BorrowerAPR', 'BorrowerRate', 'LenderYield', 'EstimatedEffectiveYield', 'EstimatedLoss', 'EstimatedReturn', 'ProsperRating (numeric)', 'ProsperRating (Alpha)', 'ListingCategory (numeric)', 'BorrowerState', 'Occupation', 'EmploymentStatus', 'EmploymentStatusDuration', 'IsBorrowerHomeowner', 'CurrentlyInGroup', 'GroupKey', 'DateCreditPulled', 'CreditScoreRangeLower', 'CreditScoreRangeUpper', 'FirstRecordedCreditLine', 'CurrentCreditLines', 'OpenCreditLines', 'TotalCreditLinespast7years', 'OpenRevolvingAccounts', 'OpenRevolvingMonthlyPayment', 'InquiriesLast6Months', 'TotalInquiries', 'CurrentDelinquencies', 'AmountDelinquent', 'DelinquenciesLast7Years', 'PublicRecordsLast10Years', 'PublicRecordsLast12Months', 'RevolvingCreditBalance', 'BankcardUtilization', 'AvailableBankcardCredit', 'TotalTrades', 'TradesNeverDelinquent (percentage)', 'TradesOpenedLast6Months', 'DebtToIncomeRatio', 'IncomeRange', 'IncomeVerifiable', 'StatedMonthlyIncome', 'LoanKey', 'TotalProsperLoans', 'TotalProsperPaymentsBilled', 'OnTimeProsperPayments', 'ProsperPaymentsLessThanOneMonthLate', 'ProsperPaymentsOneMonthPlusLate', 'ProsperPrincipalBorrowed', 'ProsperPrincipalOutstanding', 'ScorexChangeAtTimeOfListing', 'LoanCurrentDaysDelinquent', 'LoanFirstDefaultedCycleNumber', 'LoanMonthsSinceOrigination', 'LoanNumber', 'LoanOriginalAmount', 'LoanOriginationDate', 'LoanOriginationQuarter', 'MemberKey', 'MonthlyLoanPayment', 'LP_CustomerPayments', 'LP_CustomerPrincipalPayments', 'LP_InterestandFees', 'LP_ServiceFees', 'LP_CollectionFees', 'LP_GrossPrincipalLoss', 'LP_NetPrincipalLoss', 'LP_NonPrincipalRecoverypayments', 'PercentFunded', 'Recommendations', 'InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount', 'Investors']
In [17]:
# examine one ListingKey and see in which column the listings differ
check_dups_df = pld[pld.ListingKey == '17A93590655669644DB4C06']
# make sure that it is NOT spacing that makes a difference in non-duplicates
check_dups_df = check_dups_df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
#duplicaterows = check_dups_df[check_dups_df.duplicated(sub_cols)] 
#temp_df.to_csv('dupes.csv', index=None)
In [18]:
check_dups_df
Out[18]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating (numeric) ProsperRating (Alpha) ProsperScore ListingCategory (numeric) BorrowerState Occupation EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup GroupKey DateCreditPulled CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit TotalTrades TradesNeverDelinquent (percentage) TradesOpenedLast6Months DebtToIncomeRatio IncomeRange IncomeVerifiable StatedMonthlyIncome LoanKey TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter MemberKey MonthlyLoanPayment LP_CustomerPayments LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
13078 17A93590655669644DB4C06 951186 2013-10-02 17:20:16.550000000 NaN 60 Current NaN 0.16662 0.1435 0.1335 0.1264 0.0524 0.074 5.0 B 4.0 1 MD Other Employed 26.0 False False NaN 2013-12-23 09:38:12 720.0 739.0 1986-12-26 00:00:00 12.0 12.0 20.0 6 348.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 14635.0 0.57 10865.0 17.0 1.0 0.0 0.41 $25,000-49,999 True 3000.0 CB1B37030986463208432A1 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 2 126059 10000 2014-01-13 00:00:00 Q1 2014 F80D3694083622957BA09F2 234.5 234.5 112.62 121.88 -8.49 0.0 0.0 0.0 0.0 1.0 0 0 0.0 96
14888 17A93590655669644DB4C06 951186 2013-10-02 17:20:16.550000000 NaN 60 Current NaN 0.16662 0.1435 0.1335 0.1264 0.0524 0.074 5.0 B 8.0 1 MD Other Employed 26.0 False False NaN 2013-12-23 09:38:12 720.0 739.0 1986-12-26 00:00:00 12.0 12.0 20.0 6 348.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 14635.0 0.57 10865.0 17.0 1.0 0.0 0.41 $25,000-49,999 True 3000.0 CB1B37030986463208432A1 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 2 126059 10000 2014-01-13 00:00:00 Q1 2014 F80D3694083622957BA09F2 234.5 234.5 112.62 121.88 -8.49 0.0 0.0 0.0 0.0 1.0 0 0 0.0 96
20569 17A93590655669644DB4C06 951186 2013-10-02 17:20:16.550000000 NaN 60 Current NaN 0.16662 0.1435 0.1335 0.1264 0.0524 0.074 5.0 B 7.0 1 MD Other Employed 26.0 False False NaN 2013-12-23 09:38:12 720.0 739.0 1986-12-26 00:00:00 12.0 12.0 20.0 6 348.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 14635.0 0.57 10865.0 17.0 1.0 0.0 0.41 $25,000-49,999 True 3000.0 CB1B37030986463208432A1 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 2 126059 10000 2014-01-13 00:00:00 Q1 2014 F80D3694083622957BA09F2 234.5 234.5 112.62 121.88 -8.49 0.0 0.0 0.0 0.0 1.0 0 0 0.0 96
31450 17A93590655669644DB4C06 951186 2013-10-02 17:20:16.550000000 NaN 60 Current NaN 0.16662 0.1435 0.1335 0.1264 0.0524 0.074 5.0 B 10.0 1 MD Other Employed 26.0 False False NaN 2013-12-23 09:38:12 720.0 739.0 1986-12-26 00:00:00 12.0 12.0 20.0 6 348.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 14635.0 0.57 10865.0 17.0 1.0 0.0 0.41 $25,000-49,999 True 3000.0 CB1B37030986463208432A1 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 2 126059 10000 2014-01-13 00:00:00 Q1 2014 F80D3694083622957BA09F2 234.5 234.5 112.62 121.88 -8.49 0.0 0.0 0.0 0.0 1.0 0 0 0.0 96
42750 17A93590655669644DB4C06 951186 2013-10-02 17:20:16.550000000 NaN 60 Current NaN 0.16662 0.1435 0.1335 0.1264 0.0524 0.074 5.0 B 5.0 1 MD Other Employed 26.0 False False NaN 2013-12-23 09:38:12 720.0 739.0 1986-12-26 00:00:00 12.0 12.0 20.0 6 348.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 14635.0 0.57 10865.0 17.0 1.0 0.0 0.41 $25,000-49,999 True 3000.0 CB1B37030986463208432A1 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 2 126059 10000 2014-01-13 00:00:00 Q1 2014 F80D3694083622957BA09F2 234.5 234.5 112.62 121.88 -8.49 0.0 0.0 0.0 0.0 1.0 0 0 0.0 96
42751 17A93590655669644DB4C06 951186 2013-10-02 17:20:16.550000000 NaN 60 Current NaN 0.16662 0.1435 0.1335 0.1264 0.0524 0.074 5.0 B 6.0 1 MD Other Employed 26.0 False False NaN 2013-12-23 09:38:12 720.0 739.0 1986-12-26 00:00:00 12.0 12.0 20.0 6 348.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 14635.0 0.57 10865.0 17.0 1.0 0.0 0.41 $25,000-49,999 True 3000.0 CB1B37030986463208432A1 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 2 126059 10000 2014-01-13 00:00:00 Q1 2014 F80D3694083622957BA09F2 234.5 234.5 112.62 121.88 -8.49 0.0 0.0 0.0 0.0 1.0 0 0 0.0 96
In [19]:
temp_cols = ['ListingKey', 'ListingNumber', 'ListingCreationDate', 'CreditGrade', 'Term', 'LoanStatus', 'ClosedDate', 'BorrowerAPR', 'BorrowerRate', 'LenderYield', 'EstimatedEffectiveYield', 'EstimatedLoss', 'EstimatedReturn', 'ProsperRating (numeric)', 'ProsperRating (Alpha)', 'ListingCategory (numeric)', 'BorrowerState', 'Occupation', 'EmploymentStatus', 'EmploymentStatusDuration', 'IsBorrowerHomeowner', 'CurrentlyInGroup', 'GroupKey', 'DateCreditPulled', 'CreditScoreRangeLower', 'CreditScoreRangeUpper', 'FirstRecordedCreditLine', 'CurrentCreditLines', 'OpenCreditLines', 'TotalCreditLinespast7years', 'OpenRevolvingAccounts', 'OpenRevolvingMonthlyPayment', 'InquiriesLast6Months', 'TotalInquiries', 'CurrentDelinquencies', 'AmountDelinquent', 'DelinquenciesLast7Years', 'PublicRecordsLast10Years', 'PublicRecordsLast12Months', 'RevolvingCreditBalance', 'BankcardUtilization', 'AvailableBankcardCredit', 'TotalTrades', 'TradesNeverDelinquent (percentage)', 'TradesOpenedLast6Months', 'DebtToIncomeRatio', 'IncomeRange', 'IncomeVerifiable', 'StatedMonthlyIncome', 'LoanKey', 'TotalProsperLoans', 'TotalProsperPaymentsBilled', 'OnTimeProsperPayments', 'ProsperPaymentsLessThanOneMonthLate', 'ProsperPaymentsOneMonthPlusLate', 'ProsperPrincipalBorrowed', 'ProsperPrincipalOutstanding', 'ScorexChangeAtTimeOfListing', 'LoanCurrentDaysDelinquent', 'LoanFirstDefaultedCycleNumber', 'LoanMonthsSinceOrigination', 'LoanNumber', 'LoanOriginalAmount', 'LoanOriginationDate', 'LoanOriginationQuarter', 'MemberKey', 'MonthlyLoanPayment', 'LP_CustomerPayments', 'LP_CustomerPrincipalPayments', 'LP_InterestandFees', 'LP_ServiceFees', 'LP_CollectionFees', 'LP_GrossPrincipalLoss', 'LP_NetPrincipalLoss', 'LP_NonPrincipalRecoverypayments', 'PercentFunded', 'Recommendations', 'InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount', 'Investors']
In [20]:
# Exclude the column ProsperScore, then see if the duplicates are dropped
check_dups_df.drop_duplicates(subset=temp_cols, inplace=True, keep='first')
check_dups_df
Out[20]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating (numeric) ProsperRating (Alpha) ProsperScore ListingCategory (numeric) BorrowerState Occupation EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup GroupKey DateCreditPulled CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit TotalTrades TradesNeverDelinquent (percentage) TradesOpenedLast6Months DebtToIncomeRatio IncomeRange IncomeVerifiable StatedMonthlyIncome LoanKey TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter MemberKey MonthlyLoanPayment LP_CustomerPayments LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
13078 17A93590655669644DB4C06 951186 2013-10-02 17:20:16.550000000 NaN 60 Current NaN 0.16662 0.1435 0.1335 0.1264 0.0524 0.074 5.0 B 4.0 1 MD Other Employed 26.0 False False NaN 2013-12-23 09:38:12 720.0 739.0 1986-12-26 00:00:00 12.0 12.0 20.0 6 348.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 14635.0 0.57 10865.0 17.0 1.0 0.0 0.41 $25,000-49,999 True 3000.0 CB1B37030986463208432A1 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 2 126059 10000 2014-01-13 00:00:00 Q1 2014 F80D3694083622957BA09F2 234.5 234.5 112.62 121.88 -8.49 0.0 0.0 0.0 0.0 1.0 0 0 0.0 96

Notes:

The rows above were identical except in column ProsperScore. Although initially since they were inconsistently assigned, I thought they should be dropped. I think now I will take the opportunity to see if I can correct these "errors"

In [21]:
pld.corr()
Out[21]:
ListingNumber Term BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating (numeric) ProsperScore ListingCategory (numeric) EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup CreditScoreRangeLower CreditScoreRangeUpper CurrentCreditLines OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit TotalTrades TradesNeverDelinquent (percentage) TradesOpenedLast6Months DebtToIncomeRatio IncomeVerifiable StatedMonthlyIncome TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber LoanOriginalAmount MonthlyLoanPayment LP_CustomerPayments LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
ListingNumber 1.000000 0.319971 -0.031520 -0.114391 -0.113544 -0.217721 -0.239568 -0.461391 0.209200 -0.106235 0.127062 0.156224 0.076745 -0.439653 0.269395 0.269395 0.145621 0.195707 0.130369 0.232109 0.184101 -0.298941 -0.280228 -0.233326 -0.024627 -0.078336 -0.045467 -0.073151 0.057056 0.086053 0.036690 0.134729 0.254659 -0.084531 -0.041810 -0.024027 0.076113 0.174834 0.131672 0.132063 0.020873 -0.005101 0.252583 0.164248 -0.126409 -0.370226 -0.155062 -0.939179 0.995095 0.340360 0.277267 -0.380814 -0.366555 -0.281170 0.208286 0.104191 -0.251739 -0.247716 -0.096004 0.014688 -0.126031 -0.092813 -0.053481 -0.279659
Term 0.319971 1.000000 -0.011183 0.020085 0.020556 0.057047 -0.107140 0.152505 0.079179 0.028947 0.004947 0.082476 0.085339 -0.143751 0.126263 0.126263 0.077072 0.096309 0.076528 0.093637 0.084049 -0.113568 -0.103132 -0.083807 -0.016459 -0.041492 -0.029512 -0.026252 0.026022 0.031535 0.015348 0.079650 0.119342 -0.029476 -0.014670 0.040402 0.028479 0.070311 0.002462 0.003147 -0.002912 -0.008328 0.130493 0.153372 -0.048157 -0.134134 -0.070819 -0.342130 0.335610 0.338927 0.091026 -0.069696 -0.128255 0.158890 -0.167600 0.035841 -0.038353 -0.035981 -0.034157 -0.028806 -0.056680 -0.043507 -0.024472 -0.045137
BorrowerAPR -0.031520 -0.011183 1.000000 0.989824 0.989329 0.895635 0.949538 0.794275 -0.962151 -0.668287 0.132456 -0.008589 -0.132823 -0.054567 -0.429707 -0.429707 -0.093481 -0.098860 0.002513 -0.110202 -0.043365 0.146119 0.114546 0.149404 0.065679 0.162225 0.122793 0.044095 -0.058539 0.261438 -0.348926 -0.041894 -0.241349 0.103527 0.056327 -0.109975 -0.082338 -0.041467 0.031399 0.018632 0.091034 0.055345 -0.206209 -0.090347 -0.231962 0.088032 -0.185956 -0.073773 -0.036269 -0.322887 -0.226653 -0.111350 -0.180445 0.169726 0.116974 -0.041452 0.067301 0.067385 0.028519 -0.028236 -0.044236 -0.046757 -0.031841 -0.307606
BorrowerRate -0.114391 0.020085 0.989824 1.000000 0.999211 0.895282 0.945297 0.817670 -0.953105 -0.649736 0.102913 -0.019907 -0.134431 -0.001964 -0.461567 -0.461567 -0.097473 -0.105998 -0.005793 -0.125987 -0.055628 0.183810 0.153129 0.176530 0.065645 0.170279 0.128314 0.051169 -0.059608 0.255482 -0.343861 -0.048211 -0.261189 0.114538 0.062917 -0.099540 -0.088982 -0.049270 0.019217 0.006996 0.085770 0.050770 -0.206461 -0.085685 -0.217155 0.136244 -0.172936 0.020420 -0.118721 -0.328960 -0.244742 -0.081055 -0.153722 0.200479 0.095993 -0.053378 0.100635 0.100308 0.042059 -0.029025 -0.030579 -0.037312 -0.026512 -0.274217
LenderYield -0.113544 0.020556 0.989329 0.999211 1.000000 0.895343 0.945308 0.817654 -0.953119 -0.649783 0.104395 -0.018954 -0.132993 -0.012683 -0.454350 -0.454350 -0.096318 -0.104790 -0.005470 -0.124609 -0.054718 0.180483 0.149915 0.171854 0.065021 0.168000 0.126399 0.049883 -0.058665 0.253755 -0.340939 -0.047249 -0.256283 0.113482 0.061942 -0.100159 -0.088208 -0.049514 0.018945 0.006724 0.085724 0.050774 -0.206725 -0.085790 -0.217056 0.131840 -0.170825 0.019155 -0.117959 -0.328455 -0.244214 -0.079052 -0.151563 0.201162 0.097854 -0.052954 0.099506 0.099195 0.041025 -0.029209 -0.031069 -0.037302 -0.026632 -0.274174
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
PercentFunded 0.014688 -0.028806 -0.028236 -0.029025 -0.029209 -0.043366 -0.022738 -0.039422 0.012491 -0.019344 -0.011688 0.000998 -0.005696 0.023310 -0.030432 -0.030432 0.008067 0.006984 -0.001784 -0.000285 -0.005245 0.012047 0.019360 0.005126 -0.005687 -0.004250 -0.002031 0.001362 -0.007931 0.014404 -0.009940 -0.002365 0.001990 0.012343 0.002319 0.011800 -0.012895 0.014314 0.012365 0.012247 0.001907 0.004893 -0.003301 -0.005053 0.003646 0.009926 0.009547 0.013106 0.018372 -0.010248 -0.008453 -0.051394 -0.041057 -0.066877 0.069591 0.006186 -0.013144 -0.013578 0.000776 1.000000 0.009604 0.001622 0.003338 -0.049716
Recommendations -0.126031 -0.056680 -0.044236 -0.030579 -0.031069 -0.046694 0.004081 -0.019886 0.004772 0.026658 -0.028031 -0.039404 -0.009585 0.133348 -0.035080 -0.035080 -0.011274 -0.021103 -0.007311 -0.011529 -0.021471 0.044251 0.054816 0.022532 0.016132 0.006495 0.004420 0.016083 -0.005133 -0.016093 0.020526 -0.016009 -0.052601 0.016995 0.033126 -0.024966 -0.018996 0.054959 0.055991 0.056652 0.005188 -0.002938 0.015278 -0.036709 0.058998 0.063427 0.039551 0.146977 -0.129119 -0.018190 -0.009982 0.080796 0.084628 0.036074 -0.039016 -0.015910 0.041372 0.040897 0.016845 0.009604 1.000000 0.718079 0.322062 0.073824
InvestmentFromFriendsCount -0.092813 -0.043507 -0.046757 -0.037312 -0.037302 -0.063848 0.009475 -0.043022 0.007746 0.028343 -0.018968 -0.031432 -0.010382 0.100205 -0.013162 -0.013162 -0.013340 -0.018608 -0.016086 -0.008483 -0.012182 0.023588 0.029128 0.013338 0.007141 -0.001832 -0.003912 0.008314 -0.000434 -0.021871 0.027884 -0.021967 -0.032957 0.007739 0.034170 -0.030997 -0.012938 -0.000098 -0.014361 -0.012824 -0.012119 -0.009479 -0.009109 -0.018161 0.028406 0.030843 0.034885 0.110114 -0.096856 -0.007509 -0.002857 0.075496 0.080778 0.027858 -0.035004 -0.008319 0.020932 0.020350 0.010109 0.001622 0.718079 1.000000 0.483506 0.058773
InvestmentFromFriendsAmount -0.053481 -0.024472 -0.031841 -0.026512 -0.026632 -0.038146 0.024555 -0.031902 -0.010459 0.003131 -0.013096 -0.020823 -0.007612 0.046906 -0.023009 -0.023009 -0.009026 -0.010141 -0.006842 -0.005228 0.002807 0.022732 0.023352 0.015324 0.005639 0.002011 -0.003679 0.000439 0.012884 -0.004563 0.004902 -0.011642 -0.030942 0.001836 0.027886 -0.029591 -0.005251 -0.003416 -0.010248 -0.012010 0.012935 -0.004933 0.002839 -0.008924 0.009081 0.023622 0.017697 0.065337 -0.056047 0.020351 0.026684 0.074770 0.075096 0.044460 -0.056675 -0.010141 0.028762 0.028301 0.018604 0.003338 0.322062 0.483506 1.000000 0.012178
Investors -0.279659 -0.045137 -0.307606 -0.274217 -0.274174 -0.265244 -0.275501 -0.092337 0.307753 0.320594 -0.073996 -0.041597 0.125093 0.094338 0.271350 0.271350 0.039449 0.025211 0.008226 0.049841 0.055534 0.013479 0.026334 -0.087091 -0.027039 -0.105479 -0.088231 -0.010938 0.072185 -0.158053 0.207695 0.023141 0.099087 -0.013929 0.004095 0.039935 0.076244 -0.035074 -0.056302 -0.051909 -0.039575 -0.017115 0.143124 0.044922 0.152169 0.081734 0.069875 0.258590 -0.283963 0.380093 0.387409 0.562607 0.551215 0.382127 -0.524836 -0.074550 0.219807 0.216049 0.066970 -0.049716 0.073824 0.058773 0.012178 1.000000

64 rows × 64 columns

In [22]:
# there is some strange inconsistency in way prosper give prosperscore.
# for example, the loan with listingkey 17A93590655669644DB4C06 has exactly the data in all the other columns
# EXCEPT for prosperScore which had 4, 8, 7, 10, 5, 6 in content. BIZARRE!!
# For that reason, I will drop the column prosperscore, and then drop_duplicates
len(temp_cols)
Out[22]:
80
In [23]:
#tpld = pld.drop(['ProsperScore'], axis=1)
tpld = pld.copy()
In [24]:
tpld.shape
Out[24]:
(113937, 81)
In [25]:
# trim spaces again to make sure it is not spaces that make a difference
tpld = tpld.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

Remove Duplicates

Note:

Before I remove the duplicated rows, I would like to record the range of prosperScores that the listing keys were assigned

In [26]:
# dup_listing_idx contains the keys that have duplicates: tpld.loc[dup_listing_idx]
lk_ps_df = tpld[tpld.ListingKey.isin(dup_listing_idx)][['ListingKey', 'ProsperScore']]
In [27]:
lk_ps_df = lk_ps_df.sort_values(by=['ListingKey', 'ProsperScore'])
lk_ps_df.count()
Out[27]:
ListingKey      1698
ProsperScore    1698
dtype: int64
In [28]:
# create a new column ProsperScoreRange and set index to ListingKey
lk_ps_df['ProsperScoreRange'] = None
lk_ps_df = lk_ps_df.set_index(['ListingKey'])
In [29]:
# create a dictionary with key as ListingKey and the ProsperScore range as value
psr_dict = {}
for lk in dup_listing_idx:
    psr_dict[lk] = lk_ps_df.loc[lk]['ProsperScore'].tolist()
list(psr_dict.items())[:10]
Out[29]:
[('C76236036432874664B9F44', [10.0, 11.0]),
 ('7427360187662132201019D', [5.0, 8.0]),
 ('C922359071767422500E22F', [3.0, 7.0]),
 ('47C1359638497431975670B', [1.0, 2.0, 3.0, 4.0]),
 ('484535954799474542E4235', [1.0, 4.0]),
 ('4A87360013176801912490C', [6.0, 7.0]),
 ('D95835933930307208F52C4', [3.0, 9.0]),
 ('2DB435933708696686E133B', [1.0, 2.0]),
 ('425B359641086354194E4FD', [2.0, 3.0]),
 ('46DF3585535716762DE94D4', [5.0, 7.0])]
In [30]:
# place all the new range value from the dictionary into dataframe
for k, v in psr_dict.items():
    vstr = [str(i) for i in v] # convert the list into strings
    yy = ",".join(vstr)
    lk_ps_df.loc[k, 'ProsperScoreRange'] = yy
In [31]:
lk_ps_df.head(10)
Out[31]:
ProsperScore ProsperScoreRange
ListingKey
00223594917038064A7C947 5.0 5.0,6.0
00223594917038064A7C947 6.0 5.0,6.0
00473590513960687DD308F 4.0 4.0,6.0
00473590513960687DD308F 6.0 4.0,6.0
0098360461900952056DB93 2.0 2.0,4.0
0098360461900952056DB93 4.0 2.0,4.0
01163604029146842E28D9C 2.0 2.0,3.0
01163604029146842E28D9C 3.0 2.0,3.0
014F35910923350802E1B29 4.0 4.0,7.0
014F35910923350802E1B29 7.0 4.0,7.0
In [32]:
# reset the index then drop the column ProsperScore from lk_ps_df
lk_ps_df.reset_index(inplace=True)
lk_ps_df = lk_ps_df.drop(['ProsperScore'], axis=1)
In [33]:
lk_ps_df.count()
Out[33]:
ListingKey           1698
ProsperScoreRange    1698
dtype: int64
In [34]:
# now remove all the duplicates on ListingKey
lk_ps_df.drop_duplicates(subset=['ListingKey'], inplace=True, keep='first')
In [35]:
lk_ps_df.count()
Out[35]:
ListingKey           827
ProsperScoreRange    827
dtype: int64

Note:

The count above matches the count of dup_listing_idx

In [36]:
# quickly examine the listing keys in my working dataframe tpld
# the temp_cols list is the list of columns without ProsperScore
duplicaterows = tpld[tpld.duplicated(temp_cols)]
duplicaterows.ListingKey.value_counts()
Out[36]:
17A93590655669644DB4C06    5
DE8535960513435199406CE    3
47C1359638497431975670B    3
8474358854651984137201C    3
349D3587495831350F0F648    3
                          ..
866C3592918867748EDDD1F    1
B3123597150262636A7D767    1
4AED359809095985758FB75    1
EBB63592797443079EBEAF5    1
C76236036432874664B9F44    1
Name: ListingKey, Length: 827, dtype: int64
In [37]:
# backup tpld
tpld_backup = tpld.copy()
In [38]:
# drop all the duplicates in tpld based on temp_cols subset (i.e., without ProsperScore)
tpld.drop_duplicates(subset=temp_cols, inplace=True, keep='first')
In [39]:
# merge the two df tpld and lk_ps_df on ListingKey
tpld = pd.merge(tpld, lk_ps_df, on=['ListingKey'], how='left')
tpld.head()
Out[39]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating (numeric) ProsperRating (Alpha) ProsperScore ListingCategory (numeric) BorrowerState Occupation EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup GroupKey DateCreditPulled CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit TotalTrades TradesNeverDelinquent (percentage) TradesOpenedLast6Months DebtToIncomeRatio IncomeRange IncomeVerifiable StatedMonthlyIncome LoanKey TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter MemberKey MonthlyLoanPayment LP_CustomerPayments LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors ProsperScoreRange
0 1021339766868145413AB3B 193129 2007-08-26 19:09:29.263000000 C 36 Completed 2009-08-14 00:00:00 0.16516 0.1580 0.1380 NaN NaN NaN NaN NaN NaN 0 CO Other Self-employed 2.0 True True NaN 2007-08-26 18:41:46.780000000 640.0 659.0 2001-10-11 00:00:00 5.0 4.0 12.0 1 24.0 3.0 3.0 2.0 472.0 4.0 0.0 0.0 0.0 0.00 1500.0 11.0 0.81 0.0 0.17 $25,000-49,999 True 3083.333333 E33A3400205839220442E84 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 78 19141 9425 2007-09-12 00:00:00 Q3 2007 1F3E3376408759268057EDA 330.43 11396.14 9425.00 1971.14 -133.18 0.0 0.0 0.0 0.0 1.0 0 0 0.0 258 NaN
1 10273602499503308B223C1 1209647 2014-02-27 08:28:07.900000000 NaN 36 Current NaN 0.12016 0.0920 0.0820 0.07960 0.0249 0.05470 6.0 A 7.0 2 CO Professional Employed 44.0 False False NaN 2014-02-27 08:28:14 680.0 699.0 1996-03-18 00:00:00 14.0 14.0 29.0 13 389.0 3.0 5.0 0.0 0.0 0.0 1.0 0.0 3989.0 0.21 10266.0 29.0 1.00 2.0 0.18 $50,000-74,999 True 6125.000000 9E3B37071505919926B1D82 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 0 134815 10000 2014-03-03 00:00:00 Q1 2014 1D13370546739025387B2F4 318.93 0.00 0.00 0.00 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1 NaN
2 0EE9337825851032864889A 81716 2007-01-05 15:00:47.090000000 HR 36 Completed 2009-12-17 00:00:00 0.28269 0.2750 0.2400 NaN NaN NaN NaN NaN NaN 0 GA Other Not available NaN False True 783C3371218786870A73D20 2007-01-02 14:09:10.060000000 480.0 499.0 2002-07-27 00:00:00 NaN NaN 3.0 0 0.0 0.0 1.0 1.0 NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 0.06 Not displayed True 2083.333333 6954337960046817851BCB2 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 86 6466 3001 2007-01-17 00:00:00 Q1 2007 5F7033715035555618FA612 123.32 4186.63 3001.00 1185.63 -24.20 0.0 0.0 0.0 0.0 1.0 0 0 0.0 41 NaN
3 0EF5356002482715299901A 658116 2012-10-22 11:02:35.010000000 NaN 36 Current NaN 0.12528 0.0974 0.0874 0.08490 0.0249 0.06000 6.0 A 9.0 16 GA Skilled Labor Employed 113.0 True False NaN 2012-10-22 11:02:32 800.0 819.0 1983-02-28 00:00:00 5.0 5.0 29.0 7 115.0 0.0 1.0 4.0 10056.0 14.0 0.0 0.0 1444.0 0.04 30754.0 26.0 0.76 0.0 0.15 $25,000-49,999 True 2875.000000 A0393664465886295619C51 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 16 77296 10000 2012-11-01 00:00:00 Q4 2012 9ADE356069835475068C6D2 321.45 5143.20 4091.09 1052.11 -108.01 0.0 0.0 0.0 0.0 1.0 0 0 0.0 158 NaN
4 0F023589499656230C5E3E2 909464 2013-09-14 18:38:39.097000000 NaN 36 Current NaN 0.24614 0.2085 0.1985 0.18316 0.0925 0.09066 3.0 D 4.0 2 MN Executive Employed 44.0 True False NaN 2013-09-14 18:38:44 680.0 699.0 2004-02-20 00:00:00 19.0 19.0 49.0 6 220.0 1.0 9.0 0.0 0.0 0.0 0.0 0.0 6193.0 0.81 695.0 39.0 0.95 2.0 0.26 $100,000+ True 9583.333333 A180369302188889200689E 1.0 11.0 11.0 0.0 0.0 11000.0 9947.9 NaN 0 NaN 6 102670 15000 2013-09-20 00:00:00 Q3 2013 36CE356043264555721F06C 563.97 2819.85 1563.22 1256.63 -60.27 0.0 0.0 0.0 0.0 1.0 0 0 0.0 20 NaN
In [40]:
new_key_counts = tpld.ListingKey.value_counts()
new_key_counts.value_counts()
Out[40]:
1    113066
Name: ListingKey, dtype: int64
In [41]:
tpld.shape
Out[41]:
(113066, 82)
In [42]:
tpld.ProsperScore.value_counts()
Out[42]:
4.0     12443
6.0     12161
8.0     11953
7.0     10489
5.0      9732
3.0      7565
9.0      6865
2.0      5706
10.0     4691
11.0     1431
1.0       946
Name: ProsperScore, dtype: int64
In [43]:
tpld['ProsperRating (Alpha)'].value_counts()
Out[43]:
C     18096
B     15368
A     14390
D     14170
E      9716
HR     6917
AA     5325
Name: ProsperRating (Alpha), dtype: int64
In [44]:
tpld['ProsperRating (numeric)'].value_counts()
Out[44]:
4.0    18096
5.0    15368
6.0    14390
3.0    14170
2.0     9716
1.0     6917
7.0     5325
Name: ProsperRating (numeric), dtype: int64
In [45]:
tpld['CreditGrade'].value_counts()
Out[45]:
C     5649
D     5153
B     4389
AA    3509
HR    3508
A     3315
E     3289
NC     141
Name: CreditGrade, dtype: int64

Notes:

ProsperRating alpha and numeric match exactly in count. CreditGrade has the same letter grades as ProsperRating. Prosper's variable definitions did not specify if the grading system in ProsperRating and CreditGrade. I am hoping that upon analysis below that it confirms that they are the same grading system.

In [46]:
pld.shape[0] - tpld.shape[0]
Out[46]:
871
In [47]:
#select_features = ['EmploymentStatus', 'LoanOriginalAmount', 'CreditScoreRangeUpper', 'LoanStatus', 'BorrowerRate', 'StatedMonthlyIncome', 'ProsperRating (numeric)', 'ListingCategory (numeric)', 'IsBorrowerHomeowner']
select_features = ['ListingKey', 'EmploymentStatusDuration', 'EmploymentStatus', 'LoanOriginalAmount', 'CreditGrade', 'ProsperRating (numeric)', 'ProsperScore', 'ProsperScoreRange', 'BorrowerRate', 'CreditScoreRangeUpper', 'CurrentDelinquencies', 'IsBorrowerHomeowner', 'CurrentCreditLines', 'AvailableBankcardCredit']
df = tpld[select_features]
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 113066 entries, 0 to 113065
Data columns (total 14 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ListingKey                113066 non-null  object 
 1   EmploymentStatusDuration  105441 non-null  float64
 2   EmploymentStatus          110811 non-null  object 
 3   LoanOriginalAmount        113066 non-null  int64  
 4   CreditGrade               28953 non-null   object 
 5   ProsperRating (numeric)   83982 non-null   float64
 6   ProsperScore              83982 non-null   float64
 7   ProsperScoreRange         827 non-null     object 
 8   BorrowerRate              113066 non-null  float64
 9   CreditScoreRangeUpper     112475 non-null  float64
 10  CurrentDelinquencies      112369 non-null  float64
 11  IsBorrowerHomeowner       113066 non-null  bool   
 12  CurrentCreditLines        105462 non-null  float64
 13  AvailableBankcardCredit   105522 non-null  float64
dtypes: bool(1), float64(8), int64(1), object(4)
memory usage: 12.2+ MB
In [48]:
df.shape
Out[48]:
(113066, 14)
In [49]:
# There is a lot of nulls. what to do?
# drop listingnumber, borrowerapr (seems similar to borrowerrate), Investmentfromfriendscount, investmentfromfriendsamount,
# percentfunded (nearly 100%)
df.corr()
Out[49]:
EmploymentStatusDuration LoanOriginalAmount ProsperRating (numeric) ProsperScore BorrowerRate CreditScoreRangeUpper CurrentDelinquencies IsBorrowerHomeowner CurrentCreditLines AvailableBankcardCredit
EmploymentStatusDuration 1.000000 0.098219 0.036075 -0.008019 -0.019629 0.081551 -0.009143 0.171195 0.140589 0.052818
LoanOriginalAmount 0.098219 1.000000 0.429202 0.265919 -0.328339 0.341488 -0.172189 0.206859 0.200852 0.229751
ProsperRating (numeric) 0.036075 0.429202 1.000000 0.706682 -0.953102 0.549401 -0.145456 0.136271 0.092155 0.375025
ProsperScore -0.008019 0.265919 0.706682 1.000000 -0.651294 0.370050 -0.101107 0.063464 -0.009333 0.319400
BorrowerRate -0.019629 -0.328339 -0.953102 -0.651294 1.000000 -0.461342 0.176404 -0.134316 -0.097062 -0.344172
CreditScoreRangeUpper 0.081551 0.341488 0.549401 0.370050 -0.461342 1.000000 -0.368845 0.294119 0.141847 0.453301
CurrentDelinquencies -0.009143 -0.172189 -0.145456 -0.101107 0.176404 -0.368845 1.000000 -0.055865 -0.165982 -0.092643
IsBorrowerHomeowner 0.171195 0.206859 0.136271 0.063464 -0.134316 0.294119 -0.055865 1.000000 0.278930 0.142155
CurrentCreditLines 0.140589 0.200852 0.092155 -0.009333 -0.097062 0.141847 -0.165982 0.278930 1.000000 0.317422
AvailableBankcardCredit 0.052818 0.229751 0.375025 0.319400 -0.344172 0.453301 -0.092643 0.142155 0.317422 1.000000
In [50]:
df.describe()
Out[50]:
EmploymentStatusDuration LoanOriginalAmount ProsperRating (numeric) ProsperScore BorrowerRate CreditScoreRangeUpper CurrentDelinquencies CurrentCreditLines AvailableBankcardCredit
count 105441.000000 113066.000000 83982.000000 83982.000000 113066.000000 112475.000000 112369.000000 105462.000000 105522.000000
mean 96.060584 8314.762307 4.068705 5.952871 0.192946 704.524961 0.595129 10.307058 11208.182028
std 94.432241 6237.007841 1.675442 2.373897 0.074917 66.635895 1.984234 5.458196 19847.681185
min 0.000000 1000.000000 1.000000 1.000000 0.000000 19.000000 0.000000 0.000000 0.000000
25% 26.000000 4000.000000 3.000000 4.000000 0.134000 679.000000 0.000000 7.000000 876.000000
50% 67.000000 6300.000000 4.000000 6.000000 0.184000 699.000000 0.000000 10.000000 4090.000000
75% 137.000000 12000.000000 5.000000 8.000000 0.250600 739.000000 0.000000 13.000000 13165.000000
max 755.000000 35000.000000 7.000000 11.000000 0.497500 899.000000 83.000000 59.000000 646285.000000
In [51]:
df.isnull().sum()
Out[51]:
ListingKey                       0
EmploymentStatusDuration      7625
EmploymentStatus              2255
LoanOriginalAmount               0
CreditGrade                  84113
ProsperRating (numeric)      29084
ProsperScore                 29084
ProsperScoreRange           112239
BorrowerRate                     0
CreditScoreRangeUpper          591
CurrentDelinquencies           697
IsBorrowerHomeowner              0
CurrentCreditLines            7604
AvailableBankcardCredit       7544
dtype: int64
In [52]:
print(list(df.columns))
['ListingKey', 'EmploymentStatusDuration', 'EmploymentStatus', 'LoanOriginalAmount', 'CreditGrade', 'ProsperRating (numeric)', 'ProsperScore', 'ProsperScoreRange', 'BorrowerRate', 'CreditScoreRangeUpper', 'CurrentDelinquencies', 'IsBorrowerHomeowner', 'CurrentCreditLines', 'AvailableBankcardCredit']

What is the structure of your dataset?

The original dataset had (113937, 81). After discovering that Prosper had inconsistencies in using the feature ProsperScore, I added a new column called ProsperScoreRange. I then dropped all the duplicated rows but keeping the first among the duplicated. The new shape is (113937, 82). With the selected features, the final shape is (113066, 14)

What is/are the main feature(s) of interest in your dataset?

ProsperRating and BorrowerRate features are interesting. I would like to see if it's predictable based on some features. The BorrowerRate is highly correlated to ProsperRating so I believe BorrowerRate is derivative. I would like to keep track of these 2 features based on the features below.

Initially, I was interested in predicting ProsperRating. It is however highly correlative to BorrowerRate. But some samples do not have ProsperRating but they all have BorrowerRate. Ultimately if this is a automation or machine learning problem. If I choose to predict ProsperRating, it would be a classification problem. If I choose BorrowerRate, it would be a regression problem. Since it is not the purpose of this project, I will leave the automation problem for a future task.

I will attempt to correct the ProsperScore on those index that had duplicates in exploration.

Since ProsperRating and CreditGrade are mutually exclusive, I will also combine them in a new column. They both have BorrowerRate. So I will examine if they correlate with BorrowerRate.

I will examine in bivariate section how CreditGrade's relationship to BorrowerRate compares with that of PropserRating.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

I will investigate the following features.

'ListingKey', 'EmploymentStatusDuration', 'EmploymentStatus', 'LoanOriginalAmount', 'CreditGrade', 'ProsperRating (numeric)', 'ProsperScore', 'ProsperScoreRange', 'BorrowerRate', 'CreditScoreRangeUpper', 'CurrentDelinquencies', 'IsBorrowerHomeowner', 'CurrentCreditLines', 'AvailableBankcardCredit'

And keep track of these 2 features against the features above. 'BorrowerRate', 'ProsperRating (numeric)'

I retained the ListingKey so I can continue to process below.

In [53]:
df = df.rename(columns={'ProsperRating (numeric)': 'ProsperRating'})

Univariate Exploration

In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.

In [54]:
# set a base color palette
pleasant = sb.color_palette()[0]
In [55]:
#fig, ax = plt.subplots(figsize=(10,8))
#df.EmploymentStatus.plot(kind='bar', ax=ax)
#plt.show();
plt.figure(figsize=(10,8))
sb.countplot(data = df, x='EmploymentStatus', color=pleasant)
plt.xticks(rotation=45, horizontalalignment='right')
plt.title('Employment Status Count');

Notes:

Although I will compare in bivariate analysis later the relationship between EmploymentStatus and EmloymentStatusDuration, I will do them here one-by-one first.

Nearly all of the borrowers are employed. Very few borrowers have "Not employed" status. It would be interesting to see what other factors Prosper used to extend loans to the unemployed borrowers.

In [56]:
plt.figure(figsize=(10,8))
sb.histplot(data=df, x='EmploymentStatusDuration', color=pleasant, bins=100)
plt.xlabel('Employment Duration in months')
plt.title('Employment Duration in months')
Out[56]:
Text(0.5, 1.0, 'Employment Duration in months')

Notes:

It is extremely right-skewed. Most people worked below 100 months in their current employment status

In [57]:
plt.figure(figsize=(10,8))
sb.histplot(data=df, x='EmploymentStatusDuration', color=pleasant, bins=100)
plt.xscale('log')
plt.xlabel('Employment Duration in months')
plt.title('Employment Duration in months');

Notes:

I am not sure if this transformation elucidated the right-skewedness of the distribution.

In [58]:
sb.countplot(data=df, x='ProsperRating', color=pleasant)
plt.title('ProsperRating Count');

Notes:

ProsperRating is normally distributed.

In [59]:
sb.boxplot(data=df, x='ProsperRating')
plt.title('Another view of the obvious');

Notes:

This strongly confirms the normal distribution.

In [60]:
sb.countplot(data=df, x='ProsperScore', color=pleasant)
plt.title('ProsperScore Count')
Out[60]:
Text(0.5, 1.0, 'ProsperScore Count')

Note:

Both ProsperScore and ProsperRating seem to be ordinal. It is not clear to me whether bigger score is better or worse. I will examine further. However, they both seem to have a normal distribution. The Prosper Loan Dictionary did not specify which one is best.

In [61]:
plt.figure(figsize=(10,8))
sb.histplot(data=df, x='BorrowerRate', color=pleasant)
plt.title('Distribution of BorrowerRate');

Notes:

This distribution is all over the place. There wild peaks around .15, .25 and .35, and a especially huge one around .32.

In [62]:
plt.figure(figsize=(10,8))
sb.distplot(x=df.BorrowerRate, color=pleasant)
plt.title('Distribution of Borrower Rate with kde');
/home/nemo/miniconda3/envs/ml/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

Notes:

There's a lot of humps here. multimodal. It seems most loans had borrower rate between .1 and .2. However, there are also many between .2 and .3 and a really huge hump just above .3.

In [63]:
fig, ax = plt.subplots(figsize=(12,10))
#, 
df.hist(['LoanOriginalAmount', 'CreditScoreRangeUpper', 'CurrentDelinquencies', 'CurrentCreditLines', 'AvailableBankcardCredit'], ax=ax);
<ipython-input-63-55d468fa913c>:3: UserWarning: To output multiple subplots, the figure containing the passed axes is being cleared
  df.hist(['LoanOriginalAmount', 'CreditScoreRangeUpper', 'CurrentDelinquencies', 'CurrentCreditLines', 'AvailableBankcardCredit'], ax=ax);

Notes:

This is a quick series of distribution plots:

  • Most LoanOriginalAmount are below 10k, with bumps at increments of 5000.
  • CreditScoreRangeUpper range from 450 to 900
  • Nearly all borrowers have CurrentDelinquencies below 10.
  • Most borrowers have CurrentCreditLines between 5 and 15.
  • Nearly all borrowers have AvailableBankcardCredit below 10k
In [64]:
plt.figure(figsize=(10,8))
sb.countplot(data=df, x='CurrentCreditLines', color=pleasant)
plt.xticks(rotation=90)
plt.title('Number of Current Credit Lines');

Notes:

Before 20, CurrentCreditLines looks to have a normal distribution.

In [65]:
plt.figure(figsize=(10,8))
sb.countplot(data=df, x='CurrentCreditLines', color=pleasant)
plt.xlim(0,20)
plt.xticks(rotation=90)
plt.title('Number of Current Credit Lines');

Notes:

It is slightly right-skewed with the mean just slightly higher than the median. There are some big outliers. It looks almost normal when I limited the x-axis.

In [66]:
print(f"median: {df.CurrentCreditLines.median()}, mean: {df.CurrentCreditLines.mean()}")
median: 10.0, mean: 10.307058466556674
In [67]:
plt.figure(figsize=(10,8))
sb.histplot(data=df, x='LoanOriginalAmount', color=pleasant, bins=100)
plt.title('Distribution of Loan Amounts');

Notes:

As I suspected above, the loan amounts had peaks around increments of 5000.

In [68]:
plt.figure(figsize=(10,8))
sb.distplot(x=df.LoanOriginalAmount, color=pleasant)
plt.title('Distribution of Loan Amount with kde');
/home/nemo/miniconda3/envs/ml/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

Notes:

I do not know what to make of this yet. The bumps look like the bank preferred to lend in increments of 5000. Big bump at 5000, then bumps at 10000, 15000, 20000, and 25000 (perhaps even at 30000 and 35000)

In [69]:
sb.countplot(data=df, x='IsBorrowerHomeowner', color=pleasant);

Notes:

There are about as many homeowners than not. This is nearly uniform.

In [70]:
df.IsBorrowerHomeowner.value_counts()
Out[70]:
True     57052
False    56014
Name: IsBorrowerHomeowner, dtype: int64
In [71]:
plt.figure(figsize=(10,8))
sb.histplot(data=df, x='CreditScoreRangeUpper', color=pleasant, bins=100)
plt.xlim(625,800)
plt.title('Distribution of Credit Scores');

Notes:

The credit score distribution looks almost normal with the mean and the median at around 700.

I will assume that CreditGrade has the same legend as ProsperRating which is

The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009.

I will create a numeric column based on CreditGrade

Convert CreditGrade to number rating

In [72]:
df.CreditGrade.value_counts()
Out[72]:
C     5649
D     5153
B     4389
AA    3509
HR    3508
A     3315
E     3289
NC     141
Name: CreditGrade, dtype: int64
In [73]:
pr_dict = {'NC': 0, 'HR': 1, 'E': 2, 'D': 3, 'C': 4, 'B': 5, 'A': 6, 'AA': 7}
In [74]:
df.CreditGrade.replace(pr_dict, inplace=True)
In [75]:
plt.figure(figsize=(10,8))
sb.countplot(data = df, x='CreditGrade', color=pleasant)
#plt.xticks(rotation=45, horizontalalignment='right')
plt.title('Credit Grade Count');

Notes:

If we discount 0 rating, CreditGrade looks to be almost normal.

In [76]:
df[['CreditGrade', 'ProsperRating']].describe()
Out[76]:
CreditGrade ProsperRating
count 28953.000000 83982.000000
mean 3.956032 4.068705
std 1.870635 1.675442
min 0.000000 1.000000
25% 3.000000 3.000000
50% 4.000000 4.000000
75% 5.000000 5.000000
max 7.000000 7.000000

Notes:

If you discount, the 0 rating it looks normal. And the distribution is comparable to ProsperRating. This confirms that the alphabetic rating used in ProsperRating is the same as the alphabetic rating used in CreditGrade.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

My variable of interest the BorrowerRate since all the records have that. BorrowerRate has a lot of peaks. ProsperRating has a normal distribution. ProsperScore is not as normal as ProsperRating. Nevertheless, I will do some more cleaning below after I do some bivariate analysis. I attempted to do a transformation on employment duration. It confirmed that most borrowers worked at least 2 years.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

As I mentioned that there were some ListingKeys that had duplicates that only differed in their ProsperScore content. I believe these were clerical errors. I kept one of each duplicated Listing. But I created a new column called ProsperScoreRange so that I can examine the "incorrect" ProsperScores after I impute values into ProsperScore based on ProsperRating.

I converted the CreditGrade column to the numerical rating described in the description of ProsperRating in the Prosper's variable definitions. If you discount the 0 rating, CreditGrade looks normal. And the distribution is comparable to ProsperRating. This confirms that the alphabetic rating used in ProsperRating is the same as the alphabetic rating used in CreditGrade.

Bivariate Exploration

In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).

In [77]:
# Now I will create a new column combining ProsperRating and CreditGrade since they are mutually exclusive
# But first I would like to confirm that CreditGrade's relationship to BorrowerRate is similar to that of ProsperRating.
sb.catplot(data=df, x='CreditGrade', y='BorrowerRate', color=pleasant, alpha=0.1);

Notes:

The relationship density between BorrowerRate and CreditGrade trends negatively.

In [78]:
sb.catplot(data=df, x='ProsperRating', y='BorrowerRate', color=pleasant, alpha=0.1);

Notes:

Similar to above, the relationship density between BorrowerRate and ProsperRating trends negatively.

In [79]:
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True)
sb.regplot(x='ProsperRating', y='BorrowerRate', ax=ax1, data=df)
sb.regplot(x='CreditGrade', y='BorrowerRate', ax=ax2, data=df)
Out[79]:
<AxesSubplot:xlabel='CreditGrade', ylabel='BorrowerRate'>

Notes:

Side-by-side, it is clearer that they are similary correlated to BorrowerRate. It warrants that they be combine into one column. Reminder that they are mutually exclusive CreditGrade (pre mid 2009), ProsperRating (post mid 2009).

In [80]:
sb.heatmap(df.corr(), fmt='.2f', annot=True, cmap='RdBu_r')
Out[80]:
<AxesSubplot:>

Notes:

ProsperRating and ProsperScore are positively correlative

ProsperScore and ProsperRating are highly negatively correlative with BorrowerRate. Since all loans have a BorrowerRate, I believe that these 3 features are co-determined based on all the other features.

The heatmap confirms that ProsperRating and CreditGrade are in fact mutually exclusive. Their graphs also are comparable agains BorrowerRate. I will now combine ProsperRating and CreditGrade in one column.

Combine CreditGrade and ProsperRating into one column

In [81]:
df[['CreditGrade', 'ProsperRating']].notnull().sum()
Out[81]:
CreditGrade      28953
ProsperRating    83982
dtype: int64
In [82]:
df['ProsperRatingGrade'] = df['ProsperRating']
In [83]:
df.ProsperRatingGrade.update(df['CreditGrade'])
df.ProsperRatingGrade.isnull().sum()
Out[83]:
131

Notes:

All 113066 records accounted for.

In [84]:
df[['CreditGrade', 'ProsperRating', 'ProsperRatingGrade']].sample(1000).head(20)
Out[84]:
CreditGrade ProsperRating ProsperRatingGrade
87941 NaN 4.0 4.0
53992 NaN 6.0 6.0
90736 NaN 4.0 4.0
51020 4.0 NaN 4.0
105203 NaN 5.0 5.0
3042 6.0 NaN 6.0
64469 NaN 5.0 5.0
43055 NaN 7.0 7.0
47415 NaN 3.0 3.0
104118 NaN 4.0 4.0
66051 NaN 1.0 1.0
39891 NaN 2.0 2.0
112527 3.0 NaN 3.0
20379 4.0 NaN 4.0
69334 NaN 2.0 2.0
28813 NaN 2.0 2.0
51847 4.0 NaN 4.0
60375 NaN 6.0 6.0
1752 2.0 NaN 2.0
33706 3.0 NaN 3.0

Notes:

I successfully combined the CreditGrade and ProsperScore. I will now do more bivariate analysis

In [85]:
sb.pairplot(df[['EmploymentStatusDuration', 'EmploymentStatus', 'LoanOriginalAmount', 'ProsperRatingGrade', 'BorrowerRate', 'CreditScoreRangeUpper', 'CurrentDelinquencies', 'IsBorrowerHomeowner', 'CurrentCreditLines', 'AvailableBankcardCredit']])
<__array_function__ internals>:5: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<__array_function__ internals>:5: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
Out[85]:
<seaborn.axisgrid.PairGrid at 0x7f84ba0d16a0>

Notes:

This is not so readable. But this quickly gives me hints on how to plot the relationships better.

# code below to be used in exlanatory part fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True, figsize=(12,10)) sb.regplot(y='BorrowerRate', x='CreditScore', data=df.sample(frac=.25), ax=ax1, line_kws={'color':'red'}) sb.regplot(y='BorrowerRate', x='LoanOriginalAmount', data=df.sample(frac=.25), ax=ax2, line_kws={'color':'red'}) plt.xlabel('Credit Scores') ax1.set_xlim(left=300) fig.suptitle('Borrower Rates against Credit Scores rating pre-2009 and post-2009');
In [86]:
sb.catplot(data=df, x='ProsperRatingGrade', y='BorrowerRate', color=pleasant, alpha=0.1);

Notes:

ProsperRatingGrade is the column that combines CreditGrade and ProsperRating. Obviously, the density and trend of relationship of each of those features against BorrowerRate.

In [87]:
sb.lmplot(data=df.sample(11000), x='ProsperRatingGrade', y='BorrowerRate', x_jitter=.05)
Out[87]:
<seaborn.axisgrid.FacetGrid at 0x7f84a54c9100>

Notes:

This shows that the higher the ProsperRatingGrade the lower BorrowerRate.

Correcting ProsperScore

In [88]:
sb.catplot(data=df, x='ProsperRating', y='ProsperScore', color=pleasant, alpha=0.009)
Out[88]:
<seaborn.axisgrid.FacetGrid at 0x7f84a545bcd0>

Notes:

Is it justifiable to use ProsperRating to correct ProsperScore? The density hints at a positive correlation.

In [89]:
sb.lmplot(data=df.sample(11000), x='ProsperRating', y='ProsperScore')
Out[89]:
<seaborn.axisgrid.FacetGrid at 0x7f849f038f10>

Notes:

The line shows a correspondence between ProsperRating and ProsperScore.

In [90]:
# what is the average score in each rating
# get a temporary df without the dup IDs
temp_df = df[~df.ListingKey.isin(dup_listing_idx)]
In [91]:
temp_df.shape
Out[91]:
(112239, 15)
In [94]:
fig, ax = plt.subplots(figsize=(10,8))
#plt.figure(figsize=(10,8))
sb.lineplot(data=temp_df, x='ProsperRating', y='ProsperScore', marker='o')
ax.set_yticks(np.arange(11))
plt.title('ProsperScore against ProsperRating');

Notes:

The graph did not really change much. But I will use the opportunity to "correct" ProsperScore based on ProsperRating to get to know the data better

In [93]:
grptemp = temp_df.groupby('ProsperRating')
In [94]:
grptemp.mean()
Out[94]:
EmploymentStatusDuration LoanOriginalAmount CreditGrade ProsperScore BorrowerRate CreditScoreRangeUpper CurrentDelinquencies IsBorrowerHomeowner CurrentCreditLines AvailableBankcardCredit ProsperRatingGrade
ProsperRating
1.0 94.697594 3461.917693 NaN 3.601942 0.317377 696.125054 0.587741 0.499348 9.575424 4365.688161 1.0
2.0 97.429713 4563.806826 NaN 3.707335 0.293678 681.587405 0.600581 0.425874 9.884220 4224.442473 2.0
3.0 99.074659 7064.762971 NaN 4.942928 0.246761 699.423596 0.422886 0.454016 9.808742 6043.630561 3.0
4.0 106.232066 10356.121095 NaN 5.477774 0.194795 709.124286 0.291401 0.510357 10.850521 8299.424029 4.0
5.0 108.572145 11579.995648 NaN 6.775551 0.154757 726.096136 0.233087 0.555255 10.781815 12063.486285 5.0
6.0 106.000632 11434.172087 NaN 8.037367 0.112972 749.370162 0.155510 0.614455 11.074103 18899.203062 6.0
7.0 102.123508 11547.407955 NaN 9.505303 0.079157 793.571970 0.053409 0.717803 10.923295 36323.288636 7.0
In [95]:
gm = grptemp.mean()
print(gm.ProsperScore.apply(np.ceil))
ProsperRating
1.0     4.0
2.0     4.0
3.0     5.0
4.0     6.0
5.0     7.0
6.0     9.0
7.0    10.0
Name: ProsperScore, dtype: float64
In [96]:
grptemp.median()
Out[96]:
EmploymentStatusDuration LoanOriginalAmount CreditGrade ProsperScore BorrowerRate CreditScoreRangeUpper CurrentDelinquencies IsBorrowerHomeowner CurrentCreditLines AvailableBankcardCredit ProsperRatingGrade
ProsperRating
1.0 66.0 4000 NaN 4.0 0.3177 699.0 0.0 False 9.0 1233.0 1.0
2.0 68.0 4000 NaN 3.0 0.2925 679.0 0.0 False 9.0 1492.0 2.0
3.0 70.0 6000 NaN 5.0 0.2492 699.0 0.0 False 9.0 2429.0 3.0
4.0 77.0 10000 NaN 5.0 0.1915 699.0 0.0 True 10.0 3908.5 4.0
5.0 79.5 10000 NaN 7.0 0.1509 719.0 0.0 True 10.0 6278.0 5.0
6.0 77.0 10000 NaN 8.0 0.1119 739.0 0.0 True 10.0 11629.0 6.0
7.0 73.0 10500 NaN 10.0 0.0779 799.0 0.0 True 10.0 28181.0 7.0

Note:

I will use the mean scores obtained above to "correct" the ProsperScore in those duplicated indices with varying ProsperScore. Although I do not believe that this corrects the ProsperScore, I will use it for now. I will try to confirm by other means if my simple "correction" based on ProsperRating approximates truth. I did save the original ProsperScores in a new column called ProsperScoreRange.

In [97]:
# work with grouped mean dataframe and reset index
# then create a dictionary with ProsperRating as key and ProsperScore as value from mean() dataframe
gm = gm.reset_index() # necessary to use ProsperRating again as a column
p_score = dict(zip(gm.ProsperRating, gm.ProsperScore.apply(np.ceil)))
p_score
Out[97]:
{1.0: 4.0, 2.0: 4.0, 3.0: 5.0, 4.0: 6.0, 5.0: 7.0, 6.0: 9.0, 7.0: 10.0}
In [98]:
#df = work_df.copy()
In [99]:
work_df = df.copy() # just another backup of working dataframe
In [100]:
# with working dataframe df, for each Listing key in dup_listing_idx
# change the ProsperScore to the corresponding one in the ps_score dictionary
# I tried the declarative way below. But I experimented too much. I will use procedural way instead.
#work_df.loc[work_df.ListingKey.isin(dup_listing_idx), 'ProsperScore'] = work_df.ProsperScore.map(p_score)
for idx in dup_listing_idx:
    pr = float(df[df.ListingKey==idx]['ProsperRating']) # get the ProsperRating to use for dict lookup
    df.loc[(df.ListingKey==idx), 'ProsperScore'] = p_score.get(pr)
In [101]:
#peek_df = work_df[work_df.ListingKey.eq('09233589620788733CFB8CE')]
peek_df = df[df.ListingKey.isin(dup_listing_idx)]
peek_df[['CreditGrade', 'ProsperRating', 'ProsperScore', 'ProsperRatingGrade', 'ProsperScoreRange']].head(20)
Out[101]:
CreditGrade ProsperRating ProsperScore ProsperRatingGrade ProsperScoreRange
8 NaN 7.0 10.0 7.0 9.0,11.0
28 NaN 6.0 9.0 6.0 5.0,6.0
175 NaN 2.0 4.0 2.0 2.0,3.0
312 NaN 5.0 7.0 5.0 2.0,3.0
348 NaN 5.0 7.0 5.0 2.0,5.0
441 NaN 4.0 6.0 4.0 3.0,4.0
443 NaN 4.0 6.0 4.0 4.0,7.0
454 NaN 5.0 7.0 5.0 6.0,9.0
460 NaN 4.0 6.0 4.0 6.0,8.0
639 NaN 6.0 9.0 6.0 8.0,10.0
758 NaN 4.0 6.0 4.0 3.0,7.0
787 NaN 2.0 4.0 2.0 2.0,3.0
997 NaN 5.0 7.0 5.0 6.0,7.0,8.0
1066 NaN 6.0 9.0 6.0 7.0,9.0
1092 NaN 5.0 7.0 5.0 4.0,6.0
1212 NaN 3.0 5.0 3.0 1.0,2.0
1249 NaN 3.0 5.0 3.0 3.0,4.0
1272 NaN 5.0 7.0 5.0 8.0,9.0
1294 NaN 6.0 9.0 6.0 8.0,9.0
1552 NaN 5.0 7.0 5.0 9.0,10.0
In [102]:
plt.figure(figsize=(10,8))
sb.lineplot(data=df, x='ProsperRating', y='ProsperScore')
plt.title('Correlation between ProsperRating and ProsperScore (with ProsperScore corrections)');

Notes:

Again this is showing the same lineplot but with ProsperScore corrections. The visuals should not have changed since I am using ProsperRating to correct ProsperScore.

In [103]:
sb.lmplot(data=df.sample(10000), x='ProsperRating', y='BorrowerRate', x_jitter=.05)
#sb.pairplot(df, kind='reg', plot_kws={'line_kws':{'color':'red'}, 'scatter_kws': {'alpha': 0.1}})
Out[103]:
<seaborn.axisgrid.FacetGrid at 0x7f24bc757f40>

Notes:

I have seen this plot already but as regplot. However, this lmplot is computationally expensive even with a fraction of the data.

In [104]:
t_cols = list(df.columns)
print(t_cols)
['ListingKey', 'EmploymentStatusDuration', 'EmploymentStatus', 'LoanOriginalAmount', 'CreditGrade', 'ProsperRating', 'ProsperScore', 'ProsperScoreRange', 'BorrowerRate', 'CreditScoreRangeUpper', 'CurrentDelinquencies', 'IsBorrowerHomeowner', 'CurrentCreditLines', 'AvailableBankcardCredit', 'ProsperRatingGrade']
In [105]:
t_cols = ['BorrowerRate', 'LoanOriginalAmount', 'CreditScoreRangeUpper', 'ProsperRatingGrade']
In [106]:
sb.pairplot(df[t_cols].sample(10000), kind='reg', plot_kws={'line_kws':{'color':'red'}, 'scatter_kws': {'alpha': 0.1}})
Out[106]:
<seaborn.axisgrid.PairGrid at 0x7f24be1f7bb0>

Notes:

  • ProsperRatingGrade is positively correlated with CreditScoreRangeUpper.
  • LoanOriginalAmount is positively with CreditScoreRangeUpper
  • This provides information on which relationships to focus on.
In [107]:
sb.catplot(data=df, x='ProsperRatingGrade', y='BorrowerRate', color=pleasant, alpha=0.1)
Out[107]:
<seaborn.axisgrid.FacetGrid at 0x7f24bc3e7ac0>

Notes:

As I mentioned before it didn't make a difference in the big picture.

I drew the graph first with the line and then a form of a scatterplot indicating density of points. It is clear to me now that ProsperRating is ordinal. The higher the ProsperRating the lower the BorrowerRate.

In [108]:
temp_df = df[['BorrowerRate', 'ProsperRatingGrade']]
temp_df.isnull().sum()
Out[108]:
BorrowerRate            0
ProsperRatingGrade    131
dtype: int64
In [109]:
temp_df = temp_df.dropna()
temp_df.isnull().sum()
Out[109]:
BorrowerRate          0
ProsperRatingGrade    0
dtype: int64
In [110]:
temp_df.shape
Out[110]:
(112935, 2)
In [111]:
sb.lmplot(data=temp_df.sample(10000), x='ProsperRatingGrade', y='BorrowerRate', x_jitter=.05, line_kws={'color':'red'})
plt.title('Borrower Rate against Prosper Rating');

Notes:

I used a red line to show better the trend between ProsperRatingGrade and BorrowerRate.

In [112]:
sb.catplot(data=df, x='EmploymentStatus', y='EmploymentStatusDuration', color=pleasant, alpha=0.1)
plt.xticks(rotation=45, horizontalalignment='right')
plt.title('Employment Duration vs Status');

Notes:

I thought there might be a relationship between emploment status and duration. I don't see much.

In [113]:
sb.catplot(data=df, x='EmploymentStatus', y='BorrowerRate', color=pleasant, alpha=0.1)
plt.xticks(rotation=45, horizontalalignment='right');

Notes:

With the relationship more dense in the first 5 statuses, Prosper had a preference to lend to the employed. However, the rates varied just as widely regardless of employment status.

In [114]:
plt.figure(figsize=(10,8))
sb.boxplot(data=df, x='EmploymentStatus', y='BorrowerRate', color=pleasant)
plt.xticks(rotation=45, horizontalalignment='right')
plt.title('Borrower rates among employment statuses');

Notes:

As mentioned above, the rates did vary regardless of employment status. But what is clear here is that the employed enjoyed lower rates on average.

In [115]:
df.EmploymentStatus.value_counts()
Out[115]:
Employed         66598
Full-time        26354
Self-employed     6052
Not available     5347
Other             3742
Part-time         1088
Not employed       835
Retired            795
Name: EmploymentStatus, dtype: int64

Notes:

there's a definite range of rates given to the first 5 statuses.

In [116]:
sb.heatmap(df.corr(), fmt='.2f', annot=True, cmap='RdBu_r')
Out[116]:
<AxesSubplot:>

Notes:

  • Mutual exclusivity of CreditGrade and ProsperRating is quite evident.
  • High correlation between ProsperRating and CreditGrade hints at independence.
  • Low correlation between other features need to be examined more closely.
In [117]:
plt.figure(figsize=(10,8))
sb.lmplot(data=df.sample(frac=.25), x='LoanOriginalAmount', y='BorrowerRate', line_kws={'color': 'red'});
<Figure size 720x576 with 0 Axes>

Notes:

  • There seems to be a general lowering of rates as LoanOriginalAmount increases.
  • Most loans are below 15k.
  • After 15k, there are dense periods at increment of 5000.
In [118]:
plt.figure(figsize=(10,8))
sb.lmplot(data=df.sample(frac=.25), x='CreditScoreRangeUpper', y='BorrowerRate', line_kws={'color': 'red'});
<Figure size 720x576 with 0 Axes>

Notes:

  • The trend is that the higher the credit score, the lower the BorrowerRate.
  • Nearly all borrowers had credit scores above 400.
  • This could be better plotted by limiting the x axis.
In [119]:
df.CreditScoreRangeUpper.isnull().sum()
Out[119]:
591
In [120]:
df.CreditScoreRangeUpper.describe()
Out[120]:
count    112475.000000
mean        704.524961
std          66.635895
min          19.000000
25%         679.000000
50%         699.000000
75%         739.000000
max         899.000000
Name: CreditScoreRangeUpper, dtype: float64

Notes:

  • With those that had a credit scores, it looks that the higher the credit score the better the BorrowerRate.
  • A minimum of 19 in Credit Score is probably an clerical error.
  • Prosper only lent money to people with credit scores above 400.
In [121]:
# computationally expensive using sample

sb.lmplot(x="BorrowerRate", y="IsBorrowerHomeowner", data=df.sample(10000), logistic=True, y_jitter=.03);

Notes:

It looks as if they disbursed funds regardless of homeownership. However, it benefits that you are a homeowner in that the BorrowerRate is lower if you are a homeowner

In [122]:
plt.figure(figsize=(10,8))
sb.lmplot(data=df.sample(frac=.25), x='CurrentCreditLines', y='BorrowerRate', line_kws={'color': 'red'});
<Figure size 720x576 with 0 Axes>

Notes:

  • Most borrowers had credit lines below 25.
  • The trend is that the more credit line a borrower had, the lower the BorrowerRate.
In [136]:
# ProsperRatingGrade vs homeownership
plt.figure(figsize =(10,8))
sb.countplot(data=df, x='ProsperRatingGrade', hue='IsBorrowerHomeowner');

Notes:

Homeownership seemed to matter less as the ProsperRatingGrade got higher. Prosper Rating seemed to matter more

In [139]:
plt.figure(figsize=(10,8))
sb.boxplot(data=df, x='CreditGrade', y='BorrowerRate', color=pleasant)
plt.xticks(rotation=45, horizontalalignment='right')
plt.title('BorrowerRate against CreditGrade');

Notes:

Here is another look at CreditGrade vs BorrowerRate. The mean and the quartiles of BorrowerRate trend down as rating increases.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  • BorrowerRate correlates with CreditScoreRangeUpper: the better the score the lower the rate.
  • BorrowerRate correlates with ProsperRating: the higher the rating the lower the rate.
  • BorrowerRate slightly correlates with loan amount: the higher the amount the lower the rate. that is interesting.
  • It looks as if they disbursed funds regardless of homeownership. However, it benefits that you are a homeowner in that the BorrowerRate is lower if you are a homeowner.
  • Most borrowers have fewer than 25 credit lines. But the tendency is that the more credit line you have, the lower the BorrowerRate.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  • CreditScoreRangeUpper is positively correlated with LoanOriginalAmount.
  • Homeownership seemed to matter less as the ProsperRatingGrade got higher. Prosper Rating seemed to matter more.

Multivariate Exploration

Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.

In [125]:
g = sb.FacetGrid(data=df, height=6, col='ProsperRatingGrade', col_wrap=2)
g.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerRate', x_jitter=0.04, scatter_kws={'alpha':0.1});
g.add_legend()
g.fig.suptitle('Relationship between Loan Amount and BorrowerRate broken down by Prosper Rating Grade');

Notes:

It looks that the higher Prosper Rating Grade have the lowest range of borrower rates.

In [97]:
g = sb.FacetGrid(data=df, height=6, col='ProsperRatingGrade', col_wrap=4)
g.map(sb.regplot, 'CreditScoreRangeUpper', 'BorrowerRate', x_jitter=0.04, scatter_kws={'alpha':0.1}, line_kws={'color':'red'});
g.add_legend()
g.fig.suptitle('Relationship between Credit Score and BorrowerRate broken down by Prosper Rating Grade');

Notes:

Credit Score made more of a difference in the big picture with borrowers with Prosper Rating grade of 4 or below.

In [95]:
g = sb.FacetGrid(data=df, height=6, col='ProsperRatingGrade', col_wrap=4)
g.map(sb.regplot, 'AvailableBankcardCredit', 'BorrowerRate', x_jitter=0.04, scatter_kws={'alpha':0.1});
g.add_legend()
g.fig.suptitle('Relationship between Available Bank Credit and BorrowerRate broken down by Prosper Rating Grade');

Notes:

Interestingly, people with Prosper Rating of 5 or higher had a negatively correlationship between their rate and available bank credit. With those with rating of 4 or below, the relationship between rate and available credit is positively correlated. The higher available bank credit the lower the borrower rate. But if the borrower's rating is low, the higher the borrower rate is as bank credit increases. Again this might have the effect of other features which are causing the rating to lower.

In [101]:
# see the relationship of credit scores broken down # currentCreditLines CurrentDelinquencies
g = sb.FacetGrid(data=df, height=6, col='ProsperRatingGrade', col_wrap=4)
g.map(sb.regplot, 'CurrentDelinquencies', 'BorrowerRate', x_jitter=0.04, scatter_kws={'alpha':0.1}, line_kws={'color':'red'});
g.add_legend()
g.fig.suptitle('Relationship between Available Bank Credit and BorrowerRate broken down by Prosper Rating Grade');

Notes:

CurrentDelinquiencies seem to have an effect on BorrowerRate. Generally, the more deliquencies the higher the BorrowerRate. Interestingly, at low grade, the BorrowerRate seems to go lower. I think this is just the effect of having low ProsperRating.

In [98]:
g = sb.FacetGrid(data=df, height=6, col='ProsperRatingGrade', col_wrap=4)
g.map(sb.regplot, 'CurrentCreditLines', 'BorrowerRate', x_jitter=0.04, scatter_kws={'alpha':0.1}, line_kws={'color':'red'});
g.add_legend()
g.fig.suptitle('Relationship between number of credit lines and BorrowerRate broken down by Prosper Rating Grade');

Notes:

Broken down by prosper rating, the trend between BorrowerRate and number of credit lines are generally flat. However, at a rating of 1, the relationship between BorrowerRate and CurrentCreditLines is positive, i.e., at that low rating, the more credit lines the higher the rate. Interesting!

In [99]:
g = sb.FacetGrid(data=df, height=6, col='CreditGrade', col_wrap=4)
g.map(sb.regplot, 'CreditScoreRangeUpper', 'BorrowerRate', x_jitter=0.04, scatter_kws={'alpha':0.1}, line_kws={'color':'red'});
g.add_legend()
g.fig.suptitle('Relationship between pre-2009 credit grade and BorrowerRate broken down by Grade');

Notes:

This graph and the graph below are probably the most revealing during this whole investigation. This is BorrowerRate against credit scores broken down by each CreditGrade. This is where I discovered that CreditGrade is tied very closely to CreditScoreRangeUpper. Note the range of credit scores per grade.

In [100]:
g = sb.FacetGrid(data=df, height=6, col='ProsperRating', col_wrap=4)
g.map(sb.regplot, 'CreditScoreRangeUpper', 'BorrowerRate', x_jitter=0.04, scatter_kws={'alpha':0.1}, line_kws={'color':'red'});
g.add_legend()
g.fig.suptitle('Relationship between post-2009 Prosper Rating and BorrowerRate broken down by Grade');

Notes:

This is extremely interesting. The pre-2009 CreditGrade is very closely tied to the CreditScoreRangeUpper. On the other hand, post-2009 was less dependent on Credit Score.

This puts to question of combining CreditGrade and ProsperRating into one column.

However, upon looking at range of BorrowerRate, the density are comparable.

Rough visual assessment of range of BorrowerRate by Grade

Grade CreditGrade ProsperRating
0 0 0
1 .24 - .26 .32
2 .22 - .32 .28
3 .21 .24 - .26
4 .16 and .18 .18 - .22
5 .14 .16
6 .1 - .15 .12
7 .07 - .1 .07
In [142]:
#fig, ax = plt.subplots(1, 2, figsize=(12, 10))
#sb.boxplot(x='CreditGrade', y='BorrowerRate', data=df, ax=[1, 1])
#sb.boxplot(x='ProsperRating', y='BorrowerRate', data=df, ax=[1, 2])
In [166]:
# create a boolean mask IsCG IsPR
# create a column IsPR
wdf = df.copy()
In [167]:
wdf['IsPR'] = np.where(wdf.ProsperRating.notnull(), 1, None) # column IsPR
wdf['IsCG'] = np.where(wdf.CreditGrade.notnull(), 2, None) # column 
wdf['post2009'] = wdf['IsPR']
In [168]:
wdf.post2009.update(wdf['IsCG'])
wdf.post2009.isnull().sum()
Out[168]:
131
In [169]:
wdf[['IsPR', 'ProsperRating', 'IsCG', 'CreditGrade', 'post2009']].sample(1300).head(20)
Out[169]:
IsPR ProsperRating IsCG CreditGrade post2009
2730 None NaN 2 5.0 2
9124 None NaN 2 2.0 2
24834 1 1.0 None NaN 1
83390 None NaN 2 1.0 2
103325 1 4.0 None NaN 1
37703 1 6.0 None NaN 1
72512 1 6.0 None NaN 1
95756 1 4.0 None NaN 1
93417 None NaN 2 2.0 2
108788 1 4.0 None NaN 1
53287 None NaN 2 7.0 2
29343 1 1.0 None NaN 1
91386 1 7.0 None NaN 1
50087 1 4.0 None NaN 1
85562 1 3.0 None NaN 1
12304 1 3.0 None NaN 1
65070 1 4.0 None NaN 1
60038 1 4.0 None NaN 1
16582 None NaN 2 2.0 2
6262 None NaN 2 4.0 2
In [170]:
wdf = wdf.dropna(subset=['post2009'])
wdf.isnull().sum()
Out[170]:
ListingKey                       0
EmploymentStatusDuration      7625
EmploymentStatus              2255
LoanOriginalAmount               0
CreditGrade                  83982
ProsperRating                28953
ProsperScore                 28953
ProsperScoreRange           112108
BorrowerRate                     0
CreditScoreRangeUpper          591
CurrentDelinquencies           697
IsBorrowerHomeowner              0
CurrentCreditLines            7604
AvailableBankcardCredit       7544
ProsperRatingGrade               0
IsPR                         28953
IsCG                         83982
post2009                         0
dtype: int64
In [173]:
wdf.post2009.replace(2, 0, inplace=True)
In [174]:
wdf[['IsPR', 'ProsperRating', 'IsCG', 'CreditGrade', 'post2009']].sample(1300).head(20)
Out[174]:
IsPR ProsperRating IsCG CreditGrade post2009
106247 None NaN 2 7.0 0
112105 None NaN 2 4.0 0
35026 1 5.0 None NaN 1
36759 None NaN 2 2.0 0
35785 1 3.0 None NaN 1
33607 1 3.0 None NaN 1
65193 1 2.0 None NaN 1
38489 None NaN 2 4.0 0
65048 None NaN 2 4.0 0
30086 1 6.0 None NaN 1
25623 1 4.0 None NaN 1
6246 None NaN 2 3.0 0
68600 1 5.0 None NaN 1
96754 1 3.0 None NaN 1
84674 1 3.0 None NaN 1
26480 1 5.0 None NaN 1
39898 None NaN 2 4.0 0
69948 1 2.0 None NaN 1
92785 1 3.0 None NaN 1
94582 None NaN 2 4.0 0
In [176]:
plt.figure(figsize=(10,8))
sb.boxplot(x='ProsperRatingGrade', y='BorrowerRate', hue='post2009', data=wdf)
plt.title('Does it work');

Notes:

  • This puts the 2 important discoveries I had about BorrowerRate against both CreditGrade (pre-2009) and ProsperRating (post-200
  • The quartiles and mean of BorrowerRate trend downward as rating increases.
  • However, post-2009 the rating was more determinant of the BorrowerRate.
  • Pre-2009, the rates varied more across ratings.
In [177]:
plt.figure(figsize=(10,8))
sb.boxplot(x='ProsperRatingGrade', y='CreditScoreRangeUpper', hue='post2009', data=wdf)
plt.title('Does it work');

Notes:

  • Note that pre-2009, the credit scores against ratings are upward and non-overlapping.
  • Note that post-2009, the credit scores against ratings are overlapping.
  • I created a new column called post2009 indicating whether the ProsperRatingGrade is from ProsperRating (post-2009) or from CreditGrade (pre-2009).
  • I am trying to determine if the range of borrower rate by ProsperRatingGrade warranted my having combined them into one column.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

It is clear that not one factor determines the BorrowerRate or the ProsperRatingGrade.

  • CurrentDelinquiencies seem to have an effect on BorrowerRate. Generally, the more deliquencies the higher the BorrowerRate. Interestingly, at low grade, the BorrowerRate seems to go lower. I think this is just the effect of having low ProsperRating.
  • Interestingly, people with Prosper Rating of 5 or higher had a negatively correlationship between their rate and available bank credit. With those with rating of 4 or below, the relationship between rate and available credit is positively correlated. The higher available bank credit the lower the borrower rate. But if the borrower's rating is low, the higher the borrower rate is as bank credit increases. Again this might have the effect of other features which are causing the rating to lower.

Were there any interesting or surprising interactions between features?

  • This is extremely interesting. The pre-2009 CreditGrade is very closely tied to the CreditScoreRangeUpper. On the other hand, post-2009 was less dependent on Credit Score.
  • This puts to question of combining CreditGrade and ProsperRating into one column. However, upon looking at range of BorrowerRate, the density are comparable.
  • I think the CreditGrade could be scaled to match more of ProsperRating before merging but that is beyond the scope of this project
  • It is interesting to note, that features (other than ProsperRating) make more of a difference when the ProsperRating is low.

At the end of your report, make sure that you export the notebook as an html file from the File > Download as... > HTML menu. Make sure you keep track of where the exported file goes, so you can put it in the same folder as this notebook for project submission. Also, make sure you remove all of the quote-formatted guide notes like this one before you finish your report!

In [ ]: